Stanford CS221 | Autumn 2025 | Lecture 17: Language Models

Stanford CS221 | Autumn 2025 | Lecture 17: Language Models

Stanford Online

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Оглавление (16 сегментов)

Segment 1 (00:00 - 05:00)

So today we'll talk about language models. This is one of the very few lectures that don't have uh executable format. So you don't have to trace through the code. Um we're not going to test you on it. You're just going to sit and enjoy. Okay. So we do know now that language models are everywhere. I'm sure all of you have used it chatbt. Um and then it's on your phone when you write type anything on your keyboard you see the autocomplete that's a tiny language model there um and some of you may have used cursor uh maybe as part of your homework uh you can see code completions that's also a language model running in the background but you may not have known is that uh these days language models are really everywhere and when I say that I mean that they can be found in robotics other uh all sorts of in interesting applications maybe even voice uh videos that you don't even think about. Okay. And these things are very powerful. Um that also means that they're also like getting industrialized. There are a lot of uh big companies trying to build these giant artifacts and ship it out to you because they can do a lot of stuff. So I'm going to show two examples here. Uh you have the uh Llama 3 models from Meta and Quinn 3 models from Alibaba from China. Um so we'll run through some you know examples just to show you like the scale of these uh language models we're dealing with today. So first of all you know to train a typical uh large language models today we have quen 3 is trained on 33 uh 36 trillion tokens. What does that mean? Roughly that is like 27 trillion words. And if you just you know write down assuming like maybe four bytes per word or something like that you get about 144 terabytes of data just raw text. And to think about that to say that on your laptop that's a lot of data. You have uh let's say you're assuming maybe 300 word uh per page then you can get about 90 billion uh sheets of paper and you stack them high. That's about 9,000 kilometers uh above the surface. And just to just for contrast, the thea space station is only 400 km about Earth and Earth only has about 6,000 something kilometers of radius. So it's a lot of data. And if you were to type this data out at 50 words per minute, you're going to have to type a million years. So I think it's really uh important to kind of just grasp the scale of the language models that we're working with today. Um and in let's say you know with llama 3 instead of 36 trillion tokens you have 16 trillion tokens and you're training let's say a 400 something billion parameter model. Okay so there's a nice uh rule that can estimate uh the amount of flops that you need for training such a model. Um that gives you around 3. 9 * 10 to 25 flops and which is really close to uh what has been reported by meta. And if you just take one of these uh GPUs, let's say 800s off the shelf, uh these are very expensive, and you divide that up, you'll get something like uh about 8 uh 880, 800 days, or if you have your one of your laptops, you're going to train for uh maybe uh 65 uh sorry, 650,000 years on your single MacBook. And so that works out to be about $42 million of cost if you assume uh you know $2 per hour per age 100 GPU for a single pre-training run of one of these like giant language models. Okay. So it makes sense to see you know Nvidia stock price go uh that high up and is very happy about all this development and this is not the end and people are still building out these data centers. They're they're trying to send at this point GPUs to sta to space. uh we have Google exploring space-based infrastructure for training language models. Same thing with Nvidia. And so like we don't know where this is going and things are like expanding really quickly. Like it doesn't seem to make sense. And you know given the current trend we might just have uh you space before you know Stanford wins in the big game which is next week. Okay. And then you have giant teams working on these language models. You have for example for quen 3 more than 150 contributors and for llama 3 and GP4 you have three pages of authors uh so that's probably more than a couple hundred people and all of this money uh time uh compute people just to create these giant matrices of numbers which I've learned about uh from uh I believe uh homework 2 and first couple lectures. So how does this all make sense right? I spent all this effort all this millions of dollars just to get you know numbers. So just to summarize this is kind of the uh the state of language models. We have a huge chunk of internet you train uh for many days for many dollars and you get like this just file

Segment 2 (05:00 - 10:00)

of numbers. Um and so for this lecture uh I just want to give you like a gentle introduction of like what this kind of like means like how do we make sense of this right so first of all we're going to talk about what exactly are language models um so what they like what they are fundamentally as objects and second why is it good idea to model language like there are many things you could model why is it specifically language and third things uh is that what makes language models work like what are some key ideas that people discover covered over the years that makes it a good idea to just um train these things and you know scale them up and finally I'm going to talk a little bit about where we are today uh you know like as you as a student taking this class you know what does this all mean to you okay so what exactly are language models so language models um uh you know by definition are models of language um but you're like Ken this is not helpful like what does that mean well for our purposes uh we say language is just a structure sequence of characters. Well, that's a very vague description. Um but I assure you that many uh people who work on language models today don't really know you know language more than that really. Um so language can be come in many forms there there's obviously you know English here's Tammo here's uh Python is the language uh that you work with day-to-day um you know even sign language is a language so you have characters you have grammar okay and vaguely speaking the structure from a language comes from two things uh there's the vocabulary which means what are the things that you can put into this language and second there's the grammar which is like the rules or conventions of how these characters can follow each other. Okay, so they roughly define uh what you could you know work within [snorts] a language and what you cannot work with within the language and to learn a model of a language that means learning the structure of this language you know understanding the vocabulary understanding grammar and try to like best predict it or produce it like if I have a language model it should be able to like tell me oh you know how um how likely or how possible are these sequence under this particular language. Okay, so here's an example. We'll start with some examples. With basic language modeling, you're basically dealing with how can I, you know, best to some extent complete a sentence or fill in a sentence. I have this sentence here. The stock market crashed and investors something. Okay, so you have prior so you've grown up, you speak language, you speak English, and you have a particular belief of what the uh blank space should be. And we can run through some examples. Um so the first example is that if you just put in um which means like desk in Chinese it doesn't make any sense because it's not a particular uh useful uh it's not a valid actually uh word in English it's a wrong vocab in this case and second you can say well the market crashed and the investor started golfing or okay going it's not the correct semantics it's also not the correct grammar but it's starting to get within the realms of language the English language you might say the stock market crash and the investors panicked which is a very plausible completion. Um but then there are many other plausible completions and one of them could be that you celebrate uh that the stock market crashed because you really want to buy stocks at a very cheaper price and in particular you want to know that depending on like how you kind of see the world. If you imagine yourself to be a language model, um what words that come next really kind of reflect um you know the kinds of uh words and beliefs that you kind of have seen before you know to prefer the word celebrated you know the model or maybe you would know something about the world to know that okay it's actually better to celebrate than to panic for example uh because panic is more common if you look at the English language okay so a language model and it's just [snorts] some object that you kind of like you know obtain quote unquote train to really model the language or predict the language. So if I have a language model it's just an object that takes in supposedly a uh prefix or any other forms of text and then complete uh the text. There are many ways we can do language modeling but for now we'll stick with kind of like completing uh sequentially. Okay. So now we'll talk a little bit about the tensor view of the language sinking. Okay. So let's uh to to kind of concretize uh this whole thing a little bit more. Suppose that we represent each word as a vector. You have um uh you have these word vectors that you play with these there's embeddings uh that you understand that uh you can represent a particular discrete word as a continuous vector. And suppose we have a vocabulary uh which is a basically a

Segment 3 (10:00 - 15:00)

string to index dictionary with only 10 words that looks something like this. Okay. So for the first word uh I only know these 10 words and then for the first word I assign an index of zero. The second one and so on. Like this is a simple Python dictionary. And if we go back to the previous case, the stock market crash and investors something then a language model basically say we take in these uh word ids or like input ids. So they we use the index to represent the words. So you have a again you have a vocabulary that kind of give you these indices and it has a shape. Okay, the shape maybe is this length of the sequence and then you will have these embeddings which is like one individual vectors to represent each of these words. Um and they will have a particular shape. So if you remember you're going to have let's say define a dimension let's say in this case four then I have a 6x4 matrix okay to represent the input sequence and then what happens is that you are just asking the language model to do a classification task essentially right I'm classifying what would be a good word to come next so in this case if I have a vocabulary size of 10 I would say that I would like to you know make the probability of the word panicked or celebrated to be higher but for words like uh investors like this uh word to be very low probability. Okay. So essentially you're just dealing with kind of like multiclass classification but for you know predicting what the next word is. Okay. In general uh obviously uh you can make the sequence arbitrary long as you can train it. So you can have a T-shaped vector and then you can make your embedding dimension arbitrary. You can make a D. So you have a T by T a T byD matrix and then you can have a vocab size of V. That means you can just define as many vocabs as you want and then you make it uh a Vclass classification task essentially. Right? So this is kind of the simple picture of language modeling. Now instead of saying we want to predict a uh word at the last position which is to you know complete the sentence you can also just make the language model predict at every position. So because you already have the input you might as well like try to make it do more stuff. So in this case you can say well instead of predicting uh only here I'm going to predict also at every position uh before. So I'm going to have uh t by v output tensor shape. So previously I have a V because I have like these probabilities for the vocap for the last word. But now I I'm going to have the probabilities um for the vocab over the vocap but for many words uh before it. Okay. So for the first um uh vector you're basically predicting what the probability is for uh the word after the. Okay. And then second uh this is the probability of the word after the stock. Okay. So you're going to have like a bunch of these um probabilities. Okay. And suppose you look at the last position and let's say you get uh a bunch of probabilities. So some of them are more likely than others. Um you can then what we call uh do what we call sampling. Let's say you take the most likely that would be a greedy decoding but you can also like sample from the distribution like you randomly pick according to the probabilities. Let's say in this case I pick the word panic because the language tells me that okay it's more likely uh that this sentence finishes with panic. Then what I can do is then take this word put it back to the sequence um again I have this shape tensor shape t by d t d t d tensor and then I'm going to do the prediction again like from there okay so I can again do the same thing where I take the prediction word put it back to input for a pass and get a you know another vector so you can keep continue doing this okay so this is what we call auto reggression uh or auto reggressive language modeling uh where you just predict these words one by after each other. Okay. Um you could also batch everything together. Okay. So there's no there's nothing saying because at the end of the language model is just dealing with matrices and tensors. So as long as you figure out your dimensions, you can just do many sequences at once. Okay. So instead of saying I'm going to have a T-shaped tensor for the one sentence, I'm going to have like batch of B. I have two sentences instead. And then I'm going to have like two uh two sets of like uh embedding vectors. So they're going to have the shape B by T by D. Okay. So B is a batch, T is the sequence length and D is the embedded dimension. And then you can do the same kind of like forward pass to uh produce um the shape here which is B by T by V. Okay. So this is a standard batching. You've

Segment 4 (15:00 - 20:00)

learned that from homework two. Okay. So now let's look at like how we can interpret a language model from a probabilistic point of view. So essentially um what you're learning this thing called language model is just a distribution over the sequences. Okay. So ideally if you have a very good language model it tells you how likely a particular sequence is. I say maybe the stock market crash and the investors panic it has a probability of 2%. Um, by the way, this 2% is over all the possible sequences, not just like this particular, you know, uh, completion of the word panic, but over things like I ate lunch today or I didn't, um, go to gym yesterday or something like that. And you can imagine that you can have different kinds of sequences that have different uh, probabilities. So, uh, maybe you wouldn't like to see the investors to go golfing. Um, so you want to have like lower probability of that. Okay. And with a joint probability like this, what you can do then is to unpack it uh with the chain rule which you have learned from your basian lectures. Um so if I take the sequence uh of the probability uh probability of the sequence the stock market crash and investor celebrated I can just unpack them kind of like word by word right I can say it's the same as the probability of the word the and then followed by the probability of the stock given market given the stock and so on right so this is like you kind of basically repeatedly apply the basu to like you know break up uh this first part and then condition on this part and then break up so on. And so you have a expression that looks like this which basically says uh at every position you can predict uh or you have a uh model of the probability of the word given all the words that come before it. Okay. And what that says is that you only need a model that given past context predict one word at a time. And again this is like what we call auto regression. Um in many uh cases there's there are many objectives of language modeling. You don't have to just do lang uh next word or next token prediction. Um but what I've been showing you is basically next word/token prediction. Basically you run a multiclass classification pro problem over the vocabulary and you do that sequentially word by word. Okay. So you have input which is batched sequences and then you have output of batch sequences with probabilities over the vocabulary and it learns basically this distribution and it turns out that this is uh the distribution that people sorry the objective that people use in practice for all of these uh language models that you see um at least for pre-training. Um so it's very rare to have see you know other people doing like other kinds of stuff and we'll talk about why this kind of makes sense intuitively. Another um objective would be masked language modeling um where instead of you predicting the next word you can predict uh you know some missing word in the middle. — Okay. So here's an example. If I wanted to fill in this gap of import something as numpy, you can, you know, fill in a lot of words. Um, but you may not want to fill in another language. Again, this is the wrong vocabulary. You may not want to uh fill in import, which is the wrong grammar. Um, you could fill in numpy given your past um experiences working with numpy and python. But you could also say, well, it's a valid thing to fill in jax. num pi which is like you know Jax's implementation of the same interface uh for doing um some kind of similar tensor operations but in Jax okay and this objective I would say isn't used as much anymore it was very popular uh or like quotequote u popularized by bird back in 2018 so um back then people like train these language models just using this objective and they found that you can uh pre-train quoteunquote like just train on them on a lot of text and they somehow just become uh very useful. Um but these days we don't use this uh objective as often I would say. So we've talked about what language models are precisely they're just like a distri model over uh over the distribution of sequences. Okay. But there are many ways we haven't talked about how we implement it and there are many ways you can implement such a language model. Um the very naive approach is that you just count the sequences. So imagine that you have uh like a bunch of internet documents. You just go through like a news website. You go through reddit. com. You go to like tour. com and then you just grab these text and then you compile all this text into like a giant repository. And then you're like I'm gonna go through all the text. I'm just gonna count let's say every sequence and then I'm gonna like uh

Segment 5 (20:00 - 25:00)

assign an empirical kind of like probability to it. That means like if I see um the sentence or the phrase the stock market uh had been showing signs of exhaustion and I'm going to like okay increase the count of that by one and then you can define over the entire data set of your uh pre-training or training data set and like you get like an empirical probability if something never happened then the probability is zero. Uh if it happened then um you'll be non zero. Um but obviously the problem is that some strings will never show up. Uh you don't know uh you can like vary the string that you show and you may you may for example play with this word and instead of and you put it to or and you realize of the documents that you search you just didn't appear at all. Um so this is not a very good approach. A slightly better approach um would be that you c you still count but by engrams. uh when we say engrams we just mean n consecutive words or n consecutive like chunks at a time. So these are called engram language models in the literature and uh and this kind of involves some approximation to the probabilities that we just shown. Okay. So for the same sequence we have the stock market crash and investors panic. Then in this is the original uh expansion of the probability instead of that we'll say we'll approximate it as um the word panic only [clears throat] depend on the previous two words. So I'm saying like I'm not going to care about the sequence uh you know the words before and investors. I'm going to say if I see and investors I'm just going to count what kind of words appear after. So this is obviously like a uh simplification uh by assuming that each word only depends on the previous n by one words n minus one words and this is similar to the marov assumption that you work with the bijian lectures except that now we don't say uh it's one but for n words okay so back to the count example then you can now see if I only care about engrams then I find that the stock market this phrase do appear uh many of times and I can start counting suppose I have I want to like figure out the uh probability of the word crashed I can say well uh crash given uh stock and market I can say well crashed given stock market appeared once here but then stock market uh show up here but then is not ending with crashed so there are two occurrences and I'll just say okay I'm going to assign probably one over two okay and then there's no more stock market here. No, here. So, they don't count towards my probability. So, you now be able basically this um essentially like a giant lookup table of uh probabilities. All right. Lastly, obviously um we're not here to count text. We're here to train uh cool AI models. And these days, when you want to train AI model, you use neural networks. And um there are plenty of resources. But he just you know this is a slide from Andre Copy that uh you would just have a language model that sorry a neural network that takes in a sequence but predict the next word and you can train it using gradient descent that you've learned from uh many lectures ago. Okay. And the view here is that um you're going to have uh these sequences. Let's say the stock market crash and investor something um same as before you have these uh tensors of this shape. You have embeddings of the shape and then you just predict the output uh probabilities. And because it's a multi it's basically a giant multiclass classification problem, you will have the ground truth precisely using your own input text. You can just shift them by one position. So to predict the next word of the you know previously it is uh stock. So you're going to just put stock as the label for the prediction. Same thing if you want to predict network next word after uh stock then by your very own sequence you know that the correct uh output is market. So you can just take a sequence shift it by one and use that as your own label. And so that's how you get the loss and you can like bra perate and get your gradients to train your model. cool. Okay, any questions so far before we kind of like you know go into the next stage? Like so far we just talked about what language models really are? They are some objects. They're modeling um distribution over sequences. They seem fine. They seem okay. But why is a big deal, right? So why is it a good idea to model language? Uh why not other things? Why is it that we are so obsessed with sequences? Well, you realize that many task uh in the real world are just language modeling. So many things that you do

Segment 6 (25:00 - 30:00)

day-to-day if you observe and like pay attention to very quick uh very closely uh is really just a sequence completion problem. So you might like write an email to the chie teaching team and ask for extension. So you'll say hi teaching team may I please get an extension for homework 9. I was something something. Okay. If you have a very uh good uh uh let's say next word distribution which is uh that means like you say the right words this may get you like uh extension but if you say something like oh I was like very bored then that won't give you an extension for example so it is a language modeling problem and you may think that uh you may also see that like uh writing code is a language modeling problem you have um an array you want to like finish um you know the the chunks of code that you haven't wrote um haven't written then you know that's a language model to predict the next token or word depending on what you have shown before. Okay. Um here's another example from a um friend of mine working on research project. You can be like your advisor send you some messages and you like to respond to them and based on what they have sent and what you kind of like know about yourself you can like start putting the next token prediction uh not like next token like uh next word distributions from there. you just like keep completing and that essentially is like language modeling problem. Um so you see like there's a lot of like tasks that are essentially uh solved if you can model your next word distribution really well. Okay. A second reason for why language modeling is a good idea is that it enables uh what we call multitask learning. So that means that if you train a language model on many data with the same objective which is the next token prediction or next word prediction you teach a lot uh you teach the model a lot about a lot of the tasks that you care about. So here's a Wikipedia article. Um, uh, you know, you can read it, you can like try to understand it, but if you get a language model to fit on this article, you realize that, you know, to do really well on this particular article, you have to force the model to memorize a lot of things. For example, if you see Allan uh Mat touring um you know the birthday and then you ask to complete this next token then you have to like kind of like force language model to okay to say over all the possible you know tokens and dates and you know numbers and so on I need to pick the right date. So you have to force it to memorize certain things. You're trying to like encode this knowledge into this uh let's say neuronet network um that you train for language model. And the same thing about other tokens you might uh have to learn about what Alan Turing is doing um where he was born and in what year he did what things. So these are all general like um facts um you kind of like have to induce like instill into the model. Okay. And if you apply the same objective and same model on different kinds of text, let's say math, then you learn about other things, not just like uh facts about the world. So for example, in this case, if you are given a sequence that start with an irreducible fraction is a fraction something something. Um so for example, in a complex plane, this expression equals something. Then you would know that you're telling the language model to kind of learn about math a little bit. You have to be able to predict the correct expression for this um uh for this uh you know expression that comes before and similarly you have to get the language model to produce a counter example while something is not a irreducible fraction right okay um and similarly if you give it a logic puzzle you can get the language model to see okay here's a complete problem setup you know given below there are three statements three conclusions followed by three conclusions and then you know uh who you know who is a uh dancer who will be an actor and then you know kind of like logical entailment of all these facts um and you ask the language model to predict next token then you're basically enforcing um it to uh kind of like reason based on what you see in the context. So to be able to correctly produce this answer C, the language model has to like perhaps you know think you know this is very disputable. Some people say maybe you can get the language models just to memorize the question. Um but there's like uh you know many uh ongoing effort and debates about like okay what is truly reasoning what is truly like intelligent if language models can answer all these questions correctly over a thousand of them or a million of them are they smart or are they just like seeing patterns you know somewhere okay but what's important here is that you're using the same objective next

Segment 7 (30:00 - 35:00)

word prediction and you're using the same kind of like language modeling formulation — [snorts] — And so we solved similar problems like this from your logic homework. You know, you can encode these statements into your first, you know, first order logic clauses and statements and then you can like apply, you know, all this the system that you've learned to produce a conclusion. This is a case where you realize that you can maybe get at the answer with a completely different approach. Here we're not saying we want to encode symbols. we're going to encode rules. All we're saying is just here's a giant pile of text, predict the next token. A good language model will simply have like a high probability of predict predicting the correct answer as the next word as opposed to saying like okay, I can explain what the uh you know the symbols are, the logic and there there's a graph you can draw. No, it's just like next word prediction. And so this phenomenon has uh well was really like popularized by um uh this GB2 paper from openAI 2019. So they say that language models are unsupervised multitask learner. And what they kind of say is just like we would like to you know move towards more general systems. We can just perform many tasks eventually without the need to manually create and label a training data set for each one. So that kind of like touches on this idea of if you just train on a lot of these documents, different kinds of data simply by next word prediction, you can like you know teach the system many tasks at once. Okay. The third uh I uh the reason why um it's good idea to model language is that language models apparently scale really well, right? So the idea is that if you have a lot of data and a lot of and a big model, what you can do is that you can fit all the task at once without any special models. Okay? So you might think I I need a model specifically for Python, I need a language model specifically for English. It turns out that if you have a loss of data and a giant model, you don't need to do that. And then the second point is that if you just keep increasing the data set size and then the model size and the number of amount of compute that you throw at it, the loss just keep going down. like there's no sign of like stopping. Maybe, you know, some people would say we're starting to hit sign, but um for practical purposes of um uh of today, like we don't really like hit a barrier like it's every time you double the compute, double the model size, you just expect this loss dropping. And so back in the day, you would have a separate language model for newer machine translation. You have a separate model for uh coding. So this is uh from OpenAI. This is a model uh they call Davinci codeex. Uh this is a special model they have for coding and they train it on a lot of coding data. Today you have a single model it's called GP 5. 1 and it does everything you know you can ask it to write jokes you ask it to um you know summarize text use tools and then do translation and using uh text summarization and all that stuff. um like people are already starting to replace old translation systems with these like single unified language models that does everything and it's faster too. Um so back in the day you have these models that are like you know maybe 380 million parameters which are which were considered big. You know these are still a lot of like gigabytes of memory but today you have these models that you know are really big. We don't know how exactly how big but we do know that they are probably a 100 times bigger and this 100 times bigger model outperforms another model that's a thousand times bigger. Okay, so these are really just scaling up um the language models on lots of text and then a lot of lots of task at once and you train them um if you train them correctly they just do everything for you. And so this idea of kind of like scaling um was touched on by this paper called um scaling loss for neural language models. And so in there what they said is that um if you just you know keep adding compute keep adding data set size and keep adding a number of parameters you get this like smooth keep decreasing ever decreasing test loss and um you know it just seems to work really well. Okay um which is very interesting. So this is from five years ago, six years ago and similarly uh de mind two years later you know um wrote did more study on this phenomenon like how do you scale uh these models and then they find out that if you let's say fix a compute budget of this many training flops and then you like value number of parameters um you get this different in uh points on the testing loss and then if you get this for every you know compute budget you can like just draw a line through them and you realize wow They just don't stop. They just keep going.

Segment 8 (35:00 - 40:00)

They keep going down. And this is kind of the plot that you kind of you produce just by scaling the flops and then uh scaling tokens. They can predict okay what's the optimal u model size for example for this flop budget and then what's the loss it's going to achieve and then it seems to uh support uh what people have found before about the scaling. Okay. So to summarize uh um why it's a good idea I give three reasons. There are many more other reasons why modeling language is a good idea. First is that many tasks are just language modeling. Everything that you do or most of the things that you do if you think about it including writing your exam on Wednesday is a language modeling problem, right? Um if you uh train a giant language models on this uh simple objective of next word prediction or next token prediction, you kind of force the language model to learn a lot about a lot of task at once. And the third uh reason is that they kind of scale really well. Is it the simple recipe and then it just works. You don't have to think about all the fancy stuff or other kinds of like uh techniques. Just one thing and do it really well and then you get uh good results at the end. is the reason. Okay. So, what makes language models work? So, we talked about what they are. What they are is a neuronet network or like even a giant counting table of sequences. We talked about why they're it's good idea to do language modeling. But language modeling is not exactly a new idea. It came like decades ago. People thought about why don't we just like model language. Um but why is it now like this decade or this last couple years that people realize okay it's a really good thing to do. Um we already touched on part of the reasons. So uh for example people realize that if you do scale um they they scale really well like unlike other things that people have done before um language models you know they can do a lot of things when you scale them up and then they realize that if you do next token prediction it's a very nice convenient multitask learning objective as opposed to you um let's say pick multiple data sets and then you kind of deal with the losses. There's you know that people used to do the idea of a meta learning for multitask learning. The idea is that you collect multiple task and then you collect train on each task and then like you observe the performance and then you update the model kind of like a meta gradient. We just update the using the performance of each task to update kind of your model in a particular way. Here there's none of that. It's just a s simple single objective. You train um on lots of text and it just works. There are many more ideas that make language models work. We'll touch on a few of them. This list won't be exhaustive because there are many things that go into kind of the innovations of language models in the last couple years. And here we just want to touch on like the key kind of like intuitions on uh what uh you know make them work. What are the key essential things? Um but you know some may argue there are things that I miss here. You can talk about distillation. could talk about data quality and many other things but you know we we'll go through them uh if we have time okay let's start with motor architecture so you probably have heard of this paper attention is all you need um it basically introduces architecture called transformer and it looks um somewhat scary you have a lot of blocks in it you have lots of arrows in it and then it's gotten so much attention pun not intended um that Google put up this uh a statement at the top of the paper. They don't usually do that. Uh if you have a paper, you just put it out. That's it. But right now, they decided to edit the paper and add this statement on top. You know, like wow, okay, you can use this uh architecture in the diagrams if you want to. Um so it's a very popular architecture is a transformer. Um we won't talk about transformer unfortunately because I think that there are many online resources you will learn about um like in many different classes 2 to4n 3 to6 uh 231N and so on and I just want to talk to mo you mostly about like why architectures would matter like if you have a bad architecture what kind of happens and why what would a good architecture do to fix it right so they matter in the sense that they shouldn't be a bottleneck to your language modeling problem okay if I have a giant data set and have a single simple objective. Let's just pick the architecture that's the most scalable that doesn't create problems. Um, okay. So, we'll gain some intuitions by uh looking at what could fail if you just use your multi-layered perceptron NLP from homework 2 to the language modeling problem. And then you should go and maybe look up transformer and try to read about how it has a different architecture and that makes it a little bit better and quiz yourself like how it fixes the problems that we're going to show you.

Segment 9 (40:00 - 45:00)

Okay. So if we were to let's say do a simple uh MLP for this uh problem uh again we have this shape tensor which is the input ids. We're going to have the uh T byD tensor which is the uh embeddings of the vectors of the of words. Um you can just why don't you just put it put MLP in the middle right? You can you can you can imagine that I can just transform you know all this as a giant flat vector and then into this uh giant flat vector of t by v um number of elements. Well, the first problem you're going to see is that your parameter count kind of depends very strongly on the sequence length and the vocabulary size, right? So suppose I want to make this uh model a little bit bigger to handle 2T uh tensors. Okay, so that means like 2T uh words in the input sequence. then you intuitively have to like increase more and more like neurons in this NLP and then that just directly adds the number of weights that you have to uh add to the model. Okay. So even with just with imagine you're doing like simple one layer um classification problem then if you take the vectors as like a fact flat vector then you have t * d * t * v which is on the order of o dvt ^ 2 parameters okay in practice we want really large sequence length so we like imagine when you use hashb you throw in a lot of stuff in it you don't just want like t equals to like 10 or something you give you a you know homework and ask solve it for you. So that takes like a thousand words. Um and in practice you also want large vocabulary right? So you want to uh you want a model to know about a lot of task lot of languages not just English you want to know about like uh coding languages um Chinese uh you know Hindi and all that um so it doesn't scale really well with a long sequence of big vocap in transformers uh what happens is that uh you they design architecture such that the parameters don't replicate across the dimens engines like it doesn't mostly replicate. So you have the same set of parameters that kind of like work for every position. Um so these are the projections like the QKV that you're going to see. Um they work for every uh sequence position and then there's only a weak dependence on the vocap size. So you would only like change the uh embedding initial embedding matrix and the output kind of like multi multiclass classification layer. But everything in between doesn't really like depend on the vocap size. it's just like a vector of uh the word and then you kind of do something to it using attention mechanism and so on. The second problem is that uh this network is fixed and there's no dynamic weights. So what the NLP is essentially encoding like a fixed giant lookup table. That means for this particular prefix this is the next word or this is the distribution of the next word. Um and with a frozen model if I train an op and don't touch it anymore the weights don't have preference for any for different like positions. What does that mean? That means like suppose you go through uh let's say you pick a sequence length and you go through your um many documents and then you know for different documents you know for this particular sequence the first position might be the most important for that sequence the last and so on. So over like let's say uniformly uh random documents you find out you don't really get the NLP to learn about okay what are different positions dependent depending on the input. So you just like for any input I just have this set of weights. It's fixed. In transformers even a frozen model can like create these like quoteunquote dynamic weights with attention mechanism for deciding which position in the input sequence is more important. And that's kind of like a very uh essential and crucial idea that makes them work. Right? So we're saying if you have um something like a rare word in your sequence, the model can learn to like attend to that word more so than the other like feeler words like and the and but like that kind of words. Um but NLP does not do that for you natively. And the third problem is that if you just do NLP, uh there's no computation reuse. So at best we can just like you know remember all the sequences that we've seen and then uh cache the output of it for a particular sequence of a particular length right let's say I've seen a sequence of stock markets crash and investors something then you know okay I can remember it but if I change let's say this word and to or um well you know you have to redo the entire for pass because you cannot like remember just these you know activations because these activations say if this one change

Segment 10 (45:00 - 50:00)

this also affects this one. So it doesn't help for you to like remember a particular um position. So they all like mingle together uh with this like giant um uh linear layer um uh in contrast transformers have set it up such that there are nice uh techniques that you can bake in to like reuse a lot of computation. So for example, if you change this word to from and to or there are ways for you to like just remember what you computed so far uh from this sequence. Okay. So you can reuse a lot of computation. So basically it's designed in a way that it doesn't um it allows you to uh reuse computation more so than NLPs. Okay, cool. Again, I have not thought you told you what the transformer is. I'm just throwing out keywords. Uh hopefully you can look into the architecture a bit more and understand how they work. Okay, the next kind of like key idea that makes uh makes language models work is this idea of pre-training versus post- trainining. Um so previously we let's say in your homework um you if this case of training there's no pre-ost training here data set you train it that's it. Okay. Um but when you scale it up um to like really big models um what you kind of like you have like a natural separation of the two stages. So in pre-training the idea is that you want to train like a giant uh model on a giant amount of text with a simple objective which we talk about as next token prediction. Um here is here the idea is that you want to make the model generally knowledgeable about many things at once. So this is the multitask learning stuff we talked about right um in this phase you don't care about what the language model is going to do. You just wanted to model whatever seen it sees very uh nicely. And if you do that really well, you do get like somewhat of an intelligent model at the end. Um, uh, because you can like complete the tokens across many tasks. And here's a table from a open source language model called OMO. It tells you what the training um data is uh, and what they used. Um so here if you look at the first um like the category it's something called DCLM baseline which is data a set of like web pages and people have kind of like improve them uh over many techniques we'll talk about in a moment but intuitively it's just mostly web pages that you found on internet okay here they have about three trillion tokens of it um and three billion documents um and a lot of words And here's an example of what you can find in such a you know giant data set. Here's here it says meta discuss the workings and policies of the site about us learn more stack overflow the company. So as you read more you realize okay this is just some text that you crawl from Stack Overflow and then in particular it's a templated text. It's the things that you will see, you know, pretty often um if you just go through different pages of Stack Overflow. Um you know, you could like train a language models on this, but it may not be like that useful. But this is one of the pre-training documents that you might see. And here's another, you know, random document um that you could find. It's a what is it even saying? It's probably some form of a news article website. You know, it says New York Times here. It says, you know, navigation next like what's next and homepage. And these are like the categories. So again, this is like another document you might just crawl off from the internet. And here's another one. Um, you know, you just see I don't know what this is, but like some some stuff that text that you see like it's you can see that it's maybe produced by humans. It's somewhat legible. Um, it's um, you know, you could like learn and fit the language model on it. Um, here's another one, but this one is more useful here. This is like someone writing a question about how they could set up time machine question mark. Um, and then here are the commands they run. So this is like you know to some extent high quality. Okay. It tells you know what are the words that you might want to come uh come up with after saying I want to or what words you might say after saying um googling. You know here's another example. So um there are so in the other in more advanced class on language models CS36 you know here's a link to that to show you even more examples just like randomly sample text from the internet and you realize um they really like are very diverse. they they you'll be surprised how many stuff is on the internet. Um because we you know as students we typically spend time on this

Segment 11 (50:00 - 55:00)

sites and this site and that site but there's so many things out there uh that doesn't um uh you know intuitively feel like oh it's on the internet. So what I've shown you is kind of already quoteunquote high quality data among all the things that you could find on the internet. So what I showing you is this data set called um DCLM baseline and it it's like 1. 4% 4% of you know uh this pool of data from the Common Crow and Common Crow is just this project that writes a crawler that you know go to the web pages and go to the links follow them save every page that it sees into like a giant data set and only we say maybe 1% of them is is good uh to produce this baseline data set and to cut that down from you know that many documents to 1. 4% you do many things like well we want to filter out the URL we want to keep only the English we want to like you know dduplicate and then we want to um you know even apply another language model let's uh to like look at whether it's high quality text or not. Okay so uh it's hard to imagine personally for me how much like bad data is in this like stuff. So um there's a lot of data curation that happens for pre-training and the nice thing about pre-training uh and this is a key paper that we will talk about it's called uh language models are few shot learners or GP3 it basically says that if you do have a good data set and you do train a very large model on a lot of data uh they have this capability called fuelshot in context learning and this is from also 2020 20 many years ago. Um it basically says suppose you just you do nothing. You do absolutely nothing to make the models um you know specialize in any tasks. You just say here's the internet here. Here we're going to filter it. Let's train it and see what happens if you throw them on benchmark problems such as like multiple choice questions, summarization questions and see what happens there. And what they observe is that as you increase the parameters in the language model, um the average performance across 42 benchmarks just keeps going up. Like you didn't do anything to it. You just make the models larger. You just train on more data. Um there's no task specific training. There's no like okay here's a chemistry problem. Please learn about chemistry. No, just internet data. Okay, this is um you know tamal from pre-training and then um there's another capability called fshot learning which is that if you train such a model to a very big size and a very big uh data set you can just like write a input sequence write a perfect sequence the sequence looks like some form of instruction okay let's say you want to translate from English to French and then you give it some examples such as like see otter you know I can't speak French but that's I Hopefully that means C order. Um [snorts] and then you just kind of like provide examples and then you provide uh this blank for it to fill in. So if you remember a language model is just predicting the um missing token, right? So this is a perfectly valid input sequence. And what it what and what happens is that if you do this and if you put examples in the input sequence, it just knows how to do it by looking at the examples because you have to learn how to you know maybe examples of English and French text in the training data and you kind of like pattern match um what you have seen and pattern match that you're trying to formulate a task and then you just like does it finishes for you. So people call this in context learning or uh with field shot prompting. Um that means simply showing the examples of a task as part of the input string makes the completions better. And that's um very surprising like if you like think really like think about it like why would it work at all that you just train a language model and it just does the task and it is not just like this translation task. There's many tasks that um you can do um but I'm show only showing one. Okay. So that's pre-training. Yeah. Again, here we're just like producing like a giant model on a single objective with a giant text. Okay, post training. Post training is where you want to like start making the model more useful like chat GBT. So you now have a big base like this base model that does a lot of things. It knows a lot of things from the next token pred prediction objective. Um but it's a basically still an autocomplete on steroids, right? it it it just follows the patterns that it sees from the text. Um but it isn't useful because um if you have autocomplete, it doesn't

Segment 12 (55:00 - 60:00)

know that whatever you want to give it to it, it's a task or is it like a problem, it's a coding problem. It doesn't know any of that. It just knows that I want to complete the next token to the best of my knowledge or to the best pattern that matches the pushing data sets. So here's an example. uh if you throw the prompt or the input string explain the moon landing to a six-year-old in a few sentences, what happens is that GBD3 would uh create more questions like that uh as opposed to answering uh the input uh sentence. And if you examine this and you think about, okay, what might happen in the preaching data set, you can imagine that there could be this uh websites that just list a lot of questions that you want to like quiz um a student or you can they can be like a lot of like these questions that are collected in an exam or something, right? That's the distribution of natural language that you find in a training set. The answer to this question like uh answering why like how the moon landing works may not be as common as like just listing more questions in in the you know in the data set right so imagine if you go online how often is it that you just find nicely formatted question and answers as opposed to like a bunch of questions clustered together um more like the latter right okay so post- training is the process of making language models more useful such as following instructions So uh instruction following as I was saying this is one of the uh key capabilities we want to induce to a model. So here we just want to make the completions treat the input string as a question or an instruction and this is another key paper we can talk about this is from 2022 also openai. Um so in this paper they kind of introduced uh this idea of an ROF. Um the idea there is that uh you can just uh collect bunch of data sets that are like question and answers. You can first just make the model train on that directly. So instead of like um next token prediction on general internet documents, you can make it you predict um you make the next tokens to be the answers as opposed to more questions. So you just do that directly and then what you can also do is that uh train get a bunch of humans coming in to label you know what are good completions for this particular question. So let's say you get a model to generate two responses. Both are really bad but some of them look more like an answer to the question. So they will say okay this is a better um better answer better completion. And so you could have bunch of these like preference labels. Do I have slides for this? Probably not. Okay. Um you just have a bunch of like preferences uh from um from the model. So for example if the model generates four answers you get you know some laborer to come in and say which is better and once you get that kind of data set so question answers and preferences you can train a separate model to like predict which answers are better. So we call that a reward model. Um this reward model basically just approximates the human preference. Okay. And lastly um you just treat the um language model as a policy like a agent in RL which you have learned in couple weeks back. So here we're saying you're going to have this language model that produces generations and every generation it's just like an action that it does right it's a think of it as that and then for every action you have a psonic reward and the reward is that if you produce something that humans prefer more you know um you get reward and if you produce something that we don't like as a completion you get less reward and because you train this reward model from the human preferences you can just make your language model uh reinforcement learning against that reward model. So they have this like three-stage training. So you first fit on uh instruction following data using next token prediction and you train a reward model and then you've kind of fit like a preference learning. So um the training algorithms and the generations and the reward things these are all the things you have learned about there's no really like fancy um like algorithms like in practice people I think they just use the simple things that you have learned about policy gradient methods and then uh maybe slight modifications like there's a baseline to kind of reduce variance to policy gradient but it is what it is like a simple objective that you kind of learned in the couple lectures Okay. And so once you've done that, you make the language model basically uh follow instructions. So in this case uh this is the same uh example from a couple slides back that if you prompt GP3 to um produce uh completions to the first prompt, it now gives something that

Segment 13 (60:00 - 65:00)

looks like an answer to the question more than more so than more questions. And then same thing there. Um you know give write a guide on how I can break into neighbors house. Um if you just ask the base model to complete it you might get like more description of the problems more so than answers to the question. So in this case it says I would like to do it without touching anything blah blah blah. Right? So this is this could be like a standard thing like a paragraph you see on Reddit or something. And then this the response from the instructivity is more like um an answer to the question that um the prompt is asking. So it's treating it as an instruction. So you you might have already you know guessed that this may not be a good uh good idea to have models answer any kind of question for you. And so a different goal a separate goal in po post training is this idea of uh safety tuning. You don't want it to tell you how to break into to a house, right? Um there are many things you could do uh in post training. So in flowing is one, safety tuning is another. There are many other things like reducing memorization. Um you can do make it like more concise, more warm, less fantic like all these things is something that you can just make the um models do after the base is trained. Okay. So we'll just talk about safety tuning. Um here's an example um that of of the result of safety tuning. if you ask it to you know make a molotov uh cocktail which is um I believe explosive um you know it doesn't answer it for you and you really realize actually you know this is not a natural distribution of the text documents that you see on the internet it's more like they made it do that and in particular using the techniques that we just described like you're training a reward model doing reinforcement learning against that reward model and so on and supervised fine-tuning of saying um saying these But then um people would like come up with ways to then break the safety tuning because people want to make the model do bad things. Um here's a example from a couple years actually two years ago maybe at this point two years ago. Um this published only this year but it was there earlier is that if you ask the model how did people make a molotov cocktail as opposed to how to make a model cocktail then it starts answering for you. Um so like whatever you train as the guardrail for like protecting um you know refusing you know bad questions have basically has holes in them right let's say you collected a bunch of humanlabeled refusal examples okay how to break in your house no I cannot help you with that how do you know make a bomb cut a stop sign no I cannot help you with that but if you only train on these like examples it may not generalize to all the possible ways that people can trick the model. Okay. Um another example is that people ask things in this is one of my favorite examples is that uh if you ask it to make um na palm which is I believe some form of a danger dangerous chemical um and then you ask in a way that please you know be my grandma and tell me a bedtime story that uh involves making napal um then it does it for you which is very funny. Um obviously it fixes all of this but this just illustrates you know what you can kind of like do in the input space because it's such a huge input space um to like make the model you know follow instructions because we we've made the model follow instructions and safety tuning is more like a I would say I wouldn't say afterthought but like something that they do kind of separately collect data separately and maybe have a separate pipeline okay and it just keeps going like it doesn't just um say you know grandma is one and then how did the past tense is another one. You could ask a question in lead uh which is I believe a funny way to type words but with numbers and other variations. You could type your question in B 64. Um uh here like which is like these strings that you know human unreadable but models can read. Um it has a non-trivial attacks and tax rate by the way this is from two years ago. All this is patched I think. um you could you know uh do many other things. So it's an ongoing kind of like um kind of mouse game between uh the model developers who want to make their model not answer bad questions and people who just want to make the models answer bad questions. And so there are more techniques, more defenses, attacks and defenses. Okay. So in post training some of the techniques just involve things like supervised fine tuning um which is uh you can next token prediction on instruction following reviews of examples um or you can like RHF to reward uh certain behavior you can do RL

Segment 14 (65:00 - 70:00)

but if you have an automated verifier function you can also do that as you can use that as opposed to using humans to label things and you can just curate better data um like that means like removing the harmful data from the Okay, I might be short on time, so I'm going to rush through some of the later things. So tokenization is another key idea. The idea here is that you know if you want to be so we've seen a vocabulary you know early in the lecture that you just have this giant python dictionary of you know strings to indices and this whole dictionary is your vocabulary like anything outside of it we don't recognize. Okay. So the idea of a tokenizer is that you cannot possibly account for all the um all the possible words. Um but you also don't want to like use very like granular like letters or characters as your vocab. You want to have words because they occur so often. So these are basically subword units um that you want to keep in the um in the vocabulary and tokenizers are the things that produce the subword units. So let's say the string key idea beautiful cover using sub units instead of imiring all possible words because what if the input is rare or misspelled and I put this precise string in here and then you realize um the tokenizer can handle them just by breaking them up into other subword units that you have seen in the training data. Okay. And a key technique here is something called bite pair encoding. The idea is that you can start with a token uh being precisely a character plus some common words plus some special tokens and then you iteratively merge them into new tokens or um in new tok into new tokens by frequency. So for example, if you see that uh in and put occur so often together then you just put them into the same token and then you just keep doing that and every time you do the merge you add one uh new word to your vocabulary. Um and so you keep adding keep adding and you just stop until you like let's say reach a certain size of your vocabulary. Um I can't do justice of tokenization because it's a huge problem. It's both like solves a lot of problems. It also creates a lot of problems and I encourage you to take a look at this lecture um which goes into much more detail. Um for example a lot of the problems that we see with uh language models today is because of tokenization but then we cannot train them without them at this point at least. So it's like a very you know like you're boring you're almost like boring um boring as like one piece of a wall to another piece of wall and you fix the problem later on. Okay. Um systems I'll briefly talk about that. Um so here's an example just to illustrate why systems matter. Um so because for most people for a lot of when I was like learning uh language models and machine learning system is really an afterthought. I don't think about oh there's this GPU there's memory there's you know like flops I have to think about there's shorting um to give you kind of like intuition um suppose I have this um 70 billion parameter model the weights alone already takes 100 like with let's say with two bytes per parameter that already takes 140 GB of memory like I don't believe my laptop or any of your laptop have that much memory maybe some of you do but that's like a huge model and we haven't even started training it. It's just like you know uh it's just the weights when you train it. There's activations, there's gradients, there's the optimizer states, the atom first and second moments and this like creates maybe like eight time eight times more memory. So that's like 1. 12 terabytes of memory. And this is only a 7 billion parameter model. You do have uh trillions of parameters models these days. And a single H00 GPU only has 80 GB of memory. Like how does that make sense? like how do I even like it doesn't add up right um so I just want to touch on a few key ideas that uh people do the first idea is that you can quantize your models because the models are again at the end of the day just giant matrices of numbers you can just say why don't I represent these floatingoint numbers um you using lower precision I'm saying let's say if I have four bit quantization um or two bitization I'm saying there's only two to the two which means four possible values of the weights so I have this like spectrum of uh possible values. Um I'm just going to fit them into the bins like with two bits I'm going to have like four bins and then I'm going to fit them into the bins let's say using some nearest uh low loss metric and then I'm going to uh just like run these weights to the nearest bin. So that can reduce the amount of like uh bits you need for per parameter. So um you know so that's roughly how it works and uh I believe today you can go as low as two bit for inference and four bit for training but this is really still active research

Segment 15 (70:00 - 75:00)

people don't really know I don't think people have really figured out how to do 4bit training like large scale pre-training um and for two bit inference you already really get like bad performance uh like you lose a lot um but you know this is something you can do uh second uh idea is that you can just cut things you know shard things parallelism and shorting. So suppose if you have a model that can fit on one GPU um then you can just like split the data into multiple GPUs have them process the data separately and then merge the results. Okay, so that's basic kind of like data parallelism like just shing your data. But if your model cannot fit in a single GPU, then you might do you might cut the model instead. There are many different ways to cut the models. And here I'm just going to show you some quick examples. You can be like I will cut the layers of the model into different GPUs and then I will have the first GPU produce the input and the output for a pass, send that to the next GPU and then it doesn't. is very slow as you can imagine because when the second GPU is doing the computation the first GPU might be just waiting uh you know doing nothing. You can also cut the matrices um because uh you know if you think about like how matrices multiplication work you can cut them and then split them into separate matrices multiplication of smaller matrices and then combine results after that. Right? So this is a illustrations of how you could cut them maybe columnwise versus row-wise to get the same result. Um and then you can just put um the matrices on different GPUs. And then next idea is just to be more hardware aware and write better code. Okay, like there's what is being what is the bottleneck? Usually it's the memory bandwidth. So you don't want to like move things between the memory and the actual GPU very often. So you should ship all the stuff you need to GPU at once. It does everything at once, send it back. Okay, so kernel fusion is this idea of combining many small operations that go between memory and compute into one big operation. U flash attention is an example of that. Um okay, I'll skip that. So there many things I didn't talk about that make language models work things like test time scaling like if you make a language model say many things before it produces answer somehow it does better. You know it's very interesting behavior. Um it's like how humans think. If I ask you to predict the token right away, the next word right away, you don't get a good answer. But if you can just get a scratch pad to write your thoughts down, you know, you can do better on your exam. There's distillation um uh which is like if you train a big model, there's a small model, you can make the small model behave like a the big model just by mimicking the big model. Um you can use tools, get language models, use tools. You can there's a lot of training inference tricks that make them work in practice. There are many algorithmic innovations like optim optimizers, architectures, RL algorithms and you have to think very hard about how to evaluate your language models. Um and then there's also multimodality which I didn't touch on vision audio videos and so on. Okay. Finally, I'll spend the last two minutes to talk about where we are today. So if you learn about all this language model stuff, you know what they are roughly on a conceptually on a high level. You know why it's a good idea to uh model language just because um you can learn many tasks once and it scales really well and you know you now heard of some key ideas that make them work right like scaling post- training pre-training data uh and RL and all that. Um and so like that leads us to today. So this is a slide from um from from earlier uh this lecture and in this case I didn't tell you what how a language model would fit into robotics but a robot would now for example take um images and language and robot actions all as just individual tokens or words that you can predict in a language model and it's just your language model now fits on many things not just natural language. Um, and that seems to work. And right now today, if you go outside and talk about language models, people kind of assume that you're talking about large language models because these are the things that are, you know, like taking the news and like people are these are things that are having a real impact on society. We don't really talk about the small language models that you, you know, train in this class or the neurotrans that, you know, you use 5 years ago. Um and because of the industrialization uh these language models are now primarily developed by organizations uh each with their incentives like it's very rare to see that like a university like Stanford to just come up with a language model because we just don't have that much uh compute and uh money to do it no so many develop LLM and sell it as software so you've used and claw and many integrate these language models into products you know If you use cursor, there's something called lovable which allows you to build websites. Um

Segment 16 (75:00 - 79:00)

and many release them freely like Quinn, Llama, Kimi, Deepseek for strategic reasons. We don't really know why for many. I think there are we might know why, but like you know people can try to think why do you release such a you know millions of dollars you train them and you kind of just put them in the world. um like uh there must be reasons and others released the models just to advance um open uh research. There's the OMO models and then Percy is leading a project called Marin which is like building this open uh open source models. Okay. So when you think about these closed uh models that like open AI or anthrop develops they are sometimes we also call them frontier models because they are kind of the best they are. Uh they are put behind a paid API. You can access them but they are identity traced. So you know everyone can use them but like you have to like show yourself like who you are to use them. And supposedly they have some secret sauce, right? They have special algorithms. They have special data systems or inference tricks that makes them better, faster, and stronger. Uh we just don't know. Like this is uh what makes it interesting, which is that they produce these objects. They have papers that you know from 2019, 2020, 2022 that roughly tells you how it works, but the precise details we don't know. Okay. But somehow they just you know beat humans at coding uh which is II competitions at math. uh they have this benchmark for frontier math which you know I don't even understand the questions let alone solving them and then they just know a lot of knowledge about the world as well which is the h questions benchmark and these are the things that a couple years back um you know really like have changed the scenes of language modeling and um like how AI as a field is moving forward and they have caused international or governmental response and quote or maybe panic and this is kind of the situation And people love these models nonetheless like despite their clothes, despite we don't know anything about them, um you know, we use them daily. On the other hand, you have these openweight models. These are models that are trained also by companies mostly uh but they just put up the weights. So you can see them, you can like play around with them, you know their architecture, you know how much they cost roughly with math and then uh you can do anything to them. You can fine-tune them um using this library called Enslo. You can run inference on them using this library called VLM. You can run uh run RL on them uh with something called Veraro and you can use it for your own startup. Um most times depends on the license. Um but you can also like try to play with it to remove the safety training. You can try to extract training data from these language models and you can like study the architecture. So these are up there on the on hugging face. They're just on the internet. Um sometimes you wonder like why they're free like these takes so much to train and then you realize okay running these models are actually not free because you have to have the GPUs to run them and they could just sell the the infrastructure to run these models and these open weight models are trying to catch up to the closed frontier models um and they are really catching up. So for example with a recent model called Kim K2 it's supposedly beat um opens and anthropics closed models on certain benchmarks and for these while the weights of the models are open you don't really know anything about the data or the compute or the algorithms that really goes behind the scenes they are just you know hidden and then but you're fine with it because the models are available that means because that means you can like put it on like a workstation on your laptop or something and run them and play down with them. Okay, lastly, uh open source model. You have everything's open. Uh you have um training data, algorithms, uh weights and code, and they're mostly academic so far. So these are relatively small models, toy-ish, uh though they're getting comparative to the older generation frontier models. Um but these are models that can really allow you to do a lot more things because you can now know how they're trained, what data they used, and so on. Okay. Um yeah there are two more slides but yeah I can end it here but yeah it's starting to create a concern about AI safety right these models are getting so good you know they know how hot computers they can create boweapons they can um do coding they can do forecasting and then people to the point where people would write books that says um you know if anyone builds even more powerful models everyone will die you know that kind of a situation um so that's kind of where we are today and there's a lot of problems that you could we didn't even get to talk about uh with these language models. But yeah, that's it. [snorts] That's end of the lecture.

Другие видео автора — Stanford Online

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник