# Scaling Laws for Language Transfer Learning | Christina Kim | OpenAI Scholars Demo Day 2021

## Метаданные

- **Канал:** OpenAI
- **YouTube:** https://www.youtube.com/watch?v=lpe5Gwuqa-k
- **Дата:** 10.05.2021
- **Длительность:** 15:03
- **Просмотры:** 2,705

## Описание

Learn more: https://openai.com/blog/openai-scholars-2021-final-projects#christina

## Содержание

### [0:00](https://www.youtube.com/watch?v=lpe5Gwuqa-k) Introduction

hi everyone i'm christina kim and i'm really excited to present my scholars project on the scaling laws for language transfer learning um so throughout the open ai scholars program i was really interested in questions around data what characteristics and attributes are there and how does that impact model performance so for my project i looked at how do the scaling wells look for pre-trained english language models as we transfer to other languages so historically the advancement of deep learning capabilities has been centered around three different levers so that's better algorithms faster and cheaper compute and larger high quality data sets given machine learning's potential significant impact in society deepening our general understanding of machine learning and how certain factors improve models is critical for making better predictions of which capabilities are going to develop next and when further the exploration of scaling laws evidence across these three factors has created a way to measure the impact of these three as they interact and limit each other so my projects framework is inspired by the work on scaling laws which was published by openai in the past year scaling laws predict machine learning performance as i said as a function of model size data set size and the amount of compute used for training so you can think of compute data set size and model size as different limiting factors that you can be changing to get better performance and recently scaling relationships were found for transfer learning from pre-trained english texas models to python so scaling loss for transfer are important because the scaling relationships can help explain how to work in a limited data regime so in an ideal world you're going to have an infinite amount of data for your models to be learning from and by that i mean that you're only limited by the other two factors compute and model size but getting a large quantity of high quality data is a non-trivial task and it's oftentimes near impossible as a result most problems that we want to study are actually in this low data regime before the scholars program i was a machine learning engineer and i saw firsthand how costly it is in both time and money to get good quality data evaluating these trade-offs is a pretty important and practical question that many researchers and practitioners have to handle

### [2:15](https://www.youtube.com/watch?v=lpe5Gwuqa-k&t=135s) Experiments

so building upon the work from scaling laws for transfer my experiments try to answer the question how much does pre-training actually help when we're transferring across different languages um being chinese spanish and german and what does that look like as we vary the data set size and model size so for my experiments i first had to pre-train english language models um and i pre-trained decoder only transformers of size 124 million non-embedding parameters to my smallest model size which was 3. 3 million non-embedding parameters i trained this all on open web text too which is an open source version of webtext which was used to train gpt2 i used the same hyper parameters from the original scaling laws for neural languages paper except i used a 500 step warm up but the cosine decay to 10 of the max learning rate here um the text was encoded with the same gpt2 tokenizer um which is a byte level byte pair encoding with a vocab size of 50 000. and all the models were trained to about 26 billion tokens and as you can see here um my models exhibit scaling laws similar to what was found in the scaling laws for neural languages except this line isn't quite the linear here um and that kind of indicates that maybe my largest models are under trained a bit here after getting my pre-trained models um i next set up my fine tuning experiments so for i wanted to focus on changing the number of tokens and data while holding performance which in our case was cross entry loss and model size constant so for these experiments the data set size spanned six orders of magnitude while the model sizes span two orders of magnitude and i trained this on three different languages which were chinese spanish and german so for the chinese data set i use this data site called community qa which is similar to the web text corpus and then for german and spanish i got it from oscar which is a multilingual corpus got by classifying the common crawl corpus so in my experiments the thing that i really wanted to measure was the effective data transfer so what does that look like when we are training from english text to chinese spanish and german text and so the effective data transfer can be measured as this is the amount of fine tuning data needed to get to this loss when we're using a pre-trained model and then this purple dotted line is the amount of additional data that we would need to get to that same loss when we're training from scratch on this data set size so it's important to note here that as you can see the amount of data transferred from pre-training gets smaller as we increase the number of tokens in the data set size that we're looking at and eventually for this model it converges around 10 million tokens for the data set size so i wanted to show you what it looks like when we actually compare these three languages so this is like the exciting bit here and so you can see that for the pre-trained english models they help the most when we're learning german versus spanish and chinese and that kind of makes sense because i think these results reflect a lot about the linguistic similarities between english and these other languages so english and german are both derived from proto-germanic and are linguistically most similar and although spanish shares many of the same symbols as the english alphabet it's actually in a different family of languages and then obviously chinese has a very different alphabet than the english alphabet um and it's very distinct there another thing i want to highlight here is a bit about the shape of the lines and the distance between them so as you can see the effect of data transfer for spanish and chinese is not too different at this initial point here for a data set size of 8 000 tokens however as we increase the data set size we can see that pre-training continues to help for another order of magnitude compared to chinese here another way to think about the amount of data how much data is actually useful from pre-training is to think about the fraction of effective data of fine-tuning so the smaller this fraction is it means more pre-training means pre-training has helped us more so as you can see in these graphs here as the model size increases um this fraction decreases all languages which means that pre-training has become more effective um but as we increase the data set size this fraction increases across model sizes and that means pre-training has become less effective here a lot of these results here on this graph show the same points that i brought up on the previous slide about how far apart or maybe these distributions are from each other and as you can see that the german graph here has steeper curves compared to the spanish and chinese and i think that indicates that there's more transfer happening for german compared to the other two languages another interesting thing that we found was that pre-training helps most in low data regimes so in a low data regime pre-training is most helpful across the data size across model sizes but especially in the smaller model sizes and you can see here as i increase the model size with the fixed data set size of chinese text to find uh to fine tune on models trained from scratch on chinese did not improve while the models were the models pre-trained on english continue to achieve better performance so you can see here that these flat lines here are where we're data limited um in the setup versus when we start to see an increase in the slope uh we're now parameter limited another important thing to note is that pre-training and using pre-trained models is way more compute efficient than uh using uh training from scratch and you can see this here um for this one model size for this one data set size

### [8:17](https://www.youtube.com/watch?v=lpe5Gwuqa-k&t=497s) Limitations

i want to talk about some limitations that some of my experiments had and so the first one is i use the same tokenizer for all languages so this is an issue because as i mentioned before the tokenizer had a 50k vocab size and chinese um has over 50 000 characters in its uh language so that means a lot of the tokenization is probably quite inefficient and so this could impact model performance quite a bit so i think for future work uh you'd want to train your own tokenizers um and then transfer to learn from there another point is that it looks like from my original uh plots for the pre-training that maybe i could have been pre-training for longer um then i think i could have done a more linear line for some of the scaling laws that i saw for the open web text models another thing um that i would want to do is do a more thorough hyper parameter sweep and learning rate sweep um as i believe that both of these uh limitations uh would cause very different results um and i believe the numbers that i've gotten in the previous slides would be very different had i found the ideal optimum learning rates for the different data set sizes and model sizes one other note is that my data the languages that i got are from different sources and so i think this experiment could be more thorough if i had to use the same data set source for all three of the languages i want to talk about some future work that i'm really excited about after this project so i think one thing that could be really interesting is to compare the effective data transfer as we use pre-trained models of a different language back to english um then you can maybe create some kind of mapping of how far apart are distributions from each other is there some kind of symmetry in the data transfer there and what does that actually look like another obvious next steps would be to actually use the setup to do work in low resource languages or other tasks and distributions that are quite different from english another thing that would be very cool to do based on this work would be to predict the ideal ratio for pre-trained versus fine-tune for any given problem for some compute for some budget that you would have um another thing that i think would be interesting in the same format of experimentation will be studying the forgetting problem in transfer learning and see what that effective data transfer looks like as we um are approaching this problem

### [10:39](https://www.youtube.com/watch?v=lpe5Gwuqa-k&t=639s) Questions

before i answer questions i want to give some thanks to folks i want to thank jt for sharing his wisdom with me throughout the program and keeping our project on track um and for staying up late now from poland to hear this uh my fellow scholars especially danielle and cujo for sharing compute with me and everyone that gave me feedback throughout the process and program um especially danny a shout out to openai for making all this possible great so now i'll answer some questions that i have here i have a question that says which model architecture was used to transfer learning across models um and also which one was trained from scratch so the model architecture that i used is the same like gpt gbt transformer so which is a decoder only uh transformer um i have a question that says how would you extrapolate what kinds of gains from pre-training you'd get from models smaller or larger than you've been training or from smaller and larger data sets um so i think you would just be able to see the similar trends that we saw in my previous slides for the different data set sizes and i think as you as i saw the main takeaway is that if you have a large uh pre-training data a fine-tuning data set um it may uh you're not going to get as many gains as you would get from a much smaller fine-tuning data set another question is how is my setup related to the scaling laws for transfer paper by danny hernandez from earlier this year um so a lot of my work is super inspired by danny's experiments there so i did the same type of experimentation where i was changing the data set size as i was varying the model sizes and as i was comparing for the loss between those i had a question this last question says did you consider transfer between other types of languages say programming languages and i would actually say that you should check out the scaling laws for transfer paper because that actually does look into how does english transfer to python um so i got another question that says did you get a chance to study performance on metrics other than loss and i didn't but i'd be kind of curious to see how you could characterize this on uh downstream tasks and i think that's like a pretty big thing to look at for transfer learning in particular uh there's a question that says would you like to use a different tokenizer in the future and uh yeah definitely i think being using the train tokenizer on the specific languages would get you much better results and therefore probably much cleaner graphs um and then a question that says was there any reason you decided not to train models smaller than two million parameters um not particularly i just thought much smaller models than that would result in uh losses that weren't maybe that interesting to look at since it would be parameter limited um very quickly awesome so i think that's all my time so i'm going to pass it off to danielle who's going to be presenting her project

---
*Источник: https://ekstraktznaniy.ru/video/11583*