Become a Patreon: https://www.patreon.com/theaiepiphany
👨👩👧👦 Join our Discord community: https://discord.gg/peBrCpheKE
Thomas joined us for the second time to talk about their latest work: LLaMA 3! We cover synthetic data for pre/post training, why didn't they go with MoE, privacy (was it trained on Facebook user data?), and much more.
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
LLaMA 3: https://ai.meta.com/research/publications/the-llama-3-herd-of-models/
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
⌚️ Timetable:
00:00 - 00:27 Intro
00:27 - 02:08 Hyperstack GPUs platform! (sponsored)
02:08 - 06:40 What is new in new Llama?
06:40 - 13:30 Synthetic data
13:30 - 15:35 Privacy - training on Facebook user data?
15:35 - 19:10 Scaling and distillation
19:10 - 25:35 MoE, new architectures?
25:35 - 37:15 Upper boundary for the quality of SX data?
37:15 - 45:10 Context length
45:10 - 46:40 What framework does Meta use for Llama
46:40 - 51:20 Playing with smaller Llamas
51:20 - 53:20 Multilingual capabilities
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
💰 SPONSOR
The AI Epiphany - https://www.patreon.com/theaiepiphany
One-time donation - https://www.paypal.com/paypalme/theaiepiphany
Huge thank you to these AI Epiphany patreons:
Eli Mahler
Petar Veličković
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
💼 LinkedIn - https://www.linkedin.com/in/aleksagordic/
🐦 Twitter - https://twitter.com/gordic_aleksa
👨👩👧👦 Discord - https://discord.gg/peBrCpheKE
📺 YouTube - https://www.youtube.com/c/TheAIEpiphany/
📚 Medium - https://gordicaleksa.medium.com/
💻 GitHub - https://github.com/gordicaleksa
📢 AI Newsletter - https://aiepiphany.substack.com/
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
#llama3 #llms #meta #opensource
Оглавление (12 сегментов)
00:27 Intro
Again, Thomas, thanks for joining second time on the Epiphany Discord server. Glad to have you here talking about Llama 3 today. Just briefly on on Thomas, he's been one of the authors on many impactful meta papers including I guess Llama 2, Bloom, Toolformer, Code Llama, Galactica, pretty much all of the significant LLM projects you heard of over the last year, year and a half or so, Thomas has been a part of it. So again, thanks for joining in.
02:08 Hyperstack GPUs platform! (sponsored)
Hey guys, I want to give a huge shoutout to Hyperstack folks who have generously sponsored my compute over the past month or so. Basically, I got 16 H100s, which is two eight GPU nodes, and the performance has been amazing. Basically, two x the speedup compared to my A100 nodes. So I want to quickly show you how you can get started yourself. It takes basically three steps only. You go to the environments here. You create new environment. You give it a name. You pick between Canada and Norway. That's the first step. The second step, you go to SSH keys here. You create a new pair. You pick environment you just created. You paste your public SSH key. That's it. And finally, go to virtual machines. Deploy new machine. Give it a name again. Select environment. Let's select Canada here because they have more compute there. Then you select the hardware you want, like for example, H100s. You select the images. You select the SSH key you just created and hit deploy. That's it. Literally a couple of minutes to get started. It was really easy for me to just basically create these two nodes I just mentioned and get started running LLM trainings. So the documentation is super cool. I could actually solve all of my problems just by looking at their docs. They additionally have a Slack channel where they were super helpful. So I can't recommend them enough. Honestly. The thing they kind of focus on is basically because a lot of these GPU providers focus on big enterprises. So often times you can't get on demand H100s. Whereas NextGen focuses particularly on that. You can get top of the line hardware on demand even if you're individual or like a smaller team or whatnot. They also focus on bigger companies, but that's kind of their edge here. So without further ado, guys, let's go back to the talk. I do suggest you check them out and let's
06:40 What is new in new Llama?
continue. Last time, I think we met for Llama 2. It was 1 year ago. I can maybe quickly drive to from where we were to now like 5 minutes. And then yeah, I think That's a perfect start. Let's kind of see the difference between Llama 2 and Llama 3. So you know, like Well, I think the first thing is I'm frustrated of how much still there's we can do to improve those models, including Llama 3. But when I look like back and it's only a year ago actually, like almost at 1 week. It was 1 year ago Llama 2 was out. It's cool to see what we have done in a year. What Meta has done in the field and more in general like the open source community. Um The way I see it, you know, we had like uh ChatGPT in December year and half ago. Uh Meta like kicked off Galactica 2 weeks before. Llama 1 was kicked off the February few months after. That was only pre-trained models. Which means self-supervised next token prediction. Um And that's where it started. Where on my side for Galactica, I was like doing some of the work on instruction following. The kicking of the first ever at Meta annotation for LLHF instruction following models. With Galactica for uh a LaTeX instructed model that will help you write papers. Well, it never went out because of the of what happened, but it's actually converted to Llama 2. Which is basically all the foundation of post-training LLHF at Meta. Um Basically, in 4 months we created the pipeline of supervised learning and then very quickly realized that LLHF is so much more powerful because the model generates even better data than what human does for a lot of tasks. Uh we applied LLHF at scale. Uh which was um a gray zone. Like we knew people were kind of doing it at OpenAI and Anthropic with very good models. But to what extent like you know, this is just a narrative and they are doing it in practice. What is the actual algorithm at scale? What is the proportion versus supervised learning? No one knew about that. So we had to learn. Uh we created our own type of algorithm inspired by the previous works. And it resulted in Llama 2. Um I'm very proud of Llama 2 because it was the first work with instruction following at scale that worked really well uh for open science. That was proprietary. Meaning we recreated all the recipe to get there and not just distilling another LLHF model like GPT-4. That being said like we were focusing only on helpfulness, uh safety. It worked there, but we had a lot of blind spots like reasoning, code, and so many other fields. It also explained why very quickly after a lot of the people distilling GPT-4 and some other techniques like Wizard, Evol-Instruct, and things uh went to improvement over what we have done, outperforming it on some topics like reasoning, code, and so on. Um For Llama 3, we really like in particular on post-training, which is the area I worked on uh mainly, scaled Llama 2 recipe um in a lot of different dimensions, much more synthetic data, much better created, much more annotation in all the different areas. And one indication that what we did is pretty good, I guess, is since the preview release in February-March, I haven't seen a work that enables to significantly improve the post-training work we have done. Um So now Llama 3 is there. It's also extended from the preview in a lot of different dimensions, including long context, much better post-training because we kept doing that for months. Um I hope we will soon have the LIMSIS results. We can discuss evaluation. I don't think it's the end of the story and it tells everything, but it's interesting to look at. We should have improvement even for the 70B. Um and so here we are ready to now work on Llama 4.
13:30 Synthetic data
Amazing. Amazing work. Maybe you mentioned synthetics. Like we can drive this wherever you want, but like um can you share maybe a bit more on the synthetic data pipeline and how much of your pre-training currently has to deal with synthetic tokens or is it mostly post-training? I think that's why I signed the paper. But if you can maybe go a bit more into that. Right. So I think what I can say is for pre-training, very simple. The web is full of So the way we applied syn- I don't know if you can call it synthetic data. It's more like synthetic filtering. I mean, classic machine learning filtering, but leveraging Llama 2 as a classifier of topics and quality of text to select what are the right tokens. What are the good tokens? That's one. The second synthetic domain I think you can consider as synthetic data is, you know, we also published last year with Lucas in Paris Nuga. Which is basically a state-of-the-art OCR for scientific domain. It's open source actually as a project. And you can leverage that to convert PDF of scientific comments to raw topic text tokens. That's a way to leverage progress in general in the field to new better tokens for reasoning, which largely helped Llama 2 Llama 3. That's how pre-training. Uh for post-training, you know, as we said in Llama 2 already like as very early we started to do rejection sampling and LLHF and not anymore uh supervised learning. That actually means that most of the data are actually the model output itself. But reranked, filtered, improved by a reward model or some other techniques. Um And so those are pure synthetic data. Um I think we addressed the spark of and you have I don't know if I can find it easily in the paper for post-training. Maybe let me Maybe before going to post-training, you mentioned two approaches for pre-training. Like you said Llama 2 as a classifier to select good tokens, but like that's not generating synthetic tokens. That's more of like using a LLM to filter the real human data. I mean, some of that will be synthetic at this point. And then Nuga, like OCR, like how many tokens can one actually expect to extract from just PDFs? Like do you have a feeling for it? That's kind of the first sub-question. And then the second one, like is the quality really that good to like augment your and not like kind of lower down the quality of your pre-training corpus? I can't share numbers on like the number of tokens or percentage or etc., but you know, it's more like those were new tokens of quality. Like you can have an hypothesis that scientific text and PDFs are in general much higher quality than general stuff you will find on the web. So any new PDF that were in an that needs OCR uh to be in the form of a text token. That means that it's pure new knowledge reasoning tokens that were not present before. And so, it has so much more value than a random noise on the web or for the 10 times similar information on the newspaper. Mhm. So, just even a few of them and we try to scale it to thousands. Um is already very valuable. But, like when you add it on top of 15 trillion tokens, does it move the needle? Like if you have like 1 billion of such tokens or whatnot? Yeah, it's a good question, but you know, I think the the way I like to think about tokens also is how do I say? A language model is also a machine to ingest knowledge. And so, um from that perspective, you can I mean, there's two ways to consider tokens. You can see it as like, okay, I have one trillion super good token. I can input 15 time on that. And maybe it will like learn something that will emerge to more and more steps to repeat it. That's partly true, but partly wrong. Like you need at some point diversity of tokens. And the second thing is you need um new knowledge because at some point what makes those models so good is that you can ask them some stuff that you will not expect a human to know. And they have all this knowledge they can connect together. And so, it not only gives additional reasoning token, but also a pure new knowledge that model will not have otherwise. And from that perspective, any new paper is a new knowledge that the model has already. True, makes sense. Um I'm just trying to understand the dynamics of how it looks like when you actually edit in such a huge corpus. Obviously, even give such big models like 405 billion parameter checkpoint can memorize even seeing a single or a couple samples. So, you can memorize fairly quickly. So, you will add something to it. But, I'm just curious about the dynamics of merging those smaller high-quality data sources with a huge pre-training corpus and how that looks like. Did you have any ablations on that in the paper? I don't think I've seen it in case you did. It's we had some done some ablations, but this is something very tricky for a specific reason. Um you cannot do ablation obviously at 500 billion parameters size. So, what can you do? Maybe 7 billion ablations? It's already a lot. Uh 70 it starts to be tricky. But then, the thing is a 7 billion model will not memorize the same way and with the same rate than a larger model, right? And so, for these specific things you want to ablate like memorization and make data mixture, you can do to some extent, but there's a lot of limitations. So, at some point you need to play also with intuition. Yeah. Just you the yellow runs, right? At some point, it is a yellow run. Yes. But, with a lot of uh safeguards, good first principle applied, and those things. Of course. Um okay. Let me just see what Let me take one question from the
15:35 Privacy - training on Facebook user data?
chat. Um There is one on privacy. I don't know if you want to take this one or the second one is on distillation. So, there is this question, what are your thoughts on data sets and benchmarks evaluated by LLM Guard? And what method were used for the evaluation? How do you deal with challenges of privacy? Uh and so, is the question on the chat? Yeah. You can read it yourself. Yeah. want to go into the privacy topic right now. So, you what are your thoughts on data sets and benchmarks evaluated by LLM Guard? This one, right? Yeah. Perfect. So, I'm not uh I haven't worked on the LLM Guard set. So, I will not answer specifically. Um but regarding the challenge of privacy, if the question is related to uh there's two folds I can answer. One is uh Facebook uh data like user data. We didn't train on that. Like everything we have done could have been done by an external company. It's not at Facebook. That's the first thing. Uh the second thing is still on the web there's a lot of uh data that may be suffering from privacy issue like copyrightable content like that. We also use like our classifiers to remove such content. I don't know if it answers enough the question, but Mhm. Yeah, I remember there was this um recent announcement that basically you will not be able to deploy models in European Union because they forbid you to train on Facebook. I don't know if you want to get into that or whether you can chat reply on that question, but um what's the thinking there? Like will you actually be using some of the Facebook data in the forward in the next iterations of these models or if you can share that now. Uh I can share, but for a simple reason I don't know. I don't know the future. Uh but so far, we haven't explored that yet. Okay.
19:10 Scaling and distillation
Okay. Fair enough. And then the second question from Kevin, it seems that more and more papers focus on small improvements to current models. So, it seems that improving post-training matters more than improving the architecture itself. Do you think future papers from companies like Meta or Google will focus more on post-training and distillation to smaller models? So, let me answer in two ways. From a high-level perspective, my thinking here is orders of magnitude. Like scaling uh pre-training by orders of magnitude uh both weights and data tokens is one way to significantly improve. But, scaling those. Now came uh post-training RLHF. Um and the I think the original paper there like was showing that the 10 times smaller can lead to better human preference than a better model not RLHF. So, like basically the smaller GPT was one of 7B was better after RLHF than the larger GPT-3 model. So, that's already gaining order of magnitude with this technique. And now, scaling RLHF as we have done for Llama 3 is a way to significantly improve. Um so, I think we can explore so many ways to gain order of magnitude and I can discuss like synthetic data, tool use augmentation, uh maybe agents, maybe a new uh architecture improvement on algorithm, or like just a filtering better the data with more quality. All those aspects are like kind of getting order of magnitude improvement one way or another. And not just by scaling the compute, but just all of them combined together. And that's kind of what the Llama organization is trying to do as an all. Um Now, related to post-training specifically in this question, um well, I don't know if there's more focus on post-training. I would not say that. I think post-training is just much more newer than pre-training. So, I see so many more possibilities to improve. Somehow it's a way to say we did the good job, but we could have done 10 or 100 times better. While because pre-training is more mature, uh the improvement are less obvious. I think still for pre-training something that bother me uh for a long time is you spend the same compute per token, which doesn't make any sense. And maybe we could end with some smoother architectures, maybe sparse architectures that could dedicate uh choose the compute uh in real time. Uh there's a lot of work that's happening actually right now, which could unlock some new orders of magnitude basically. Um related to distillation of smaller models, um we are doing kind of some distillation somehow with post-training annotate annotation, bridging the big model, and then distilling back to the smaller models. That's one way of distilling. Um maybe there's other ways that we haven't explored yet. Like can we leverage the 405B model and distillate it? Maybe we'll have some work on that way if we figure out. But, we are definitely exploring like all the aspects from pre-training to post-training and distillation together. And we'll now more release more like frequently when we have some findings.
25:35 MoE, new architectures?
Quick question on the architecture side. You mentioned sparse architectures. Um is there any special reason that you didn't commit to using mixture of experts as or was it just like let's just ship the dense one? We have the infrastructure. We have the know-how currently to just execute on this quickly. Or was it was there any more significant thought that went into picking dense versus just going MoE? So, the first thing I would say is you know, I think about MoE as uh it's all of course a bit more complex than that, but a dense with more than one expert. And from that perspective, to me it's just like an hyperparameter to optimize. I think it's more than that because you can have much more flexibility in the architecture, but still um So, yes, of course, for Lama 4, we're exploring in the future optimizing the CPA parameter. Um No, yes, of course, the fact that we had all enough infrastructure ready for dense set. Doesn't mean we are not exploring them with, of course. Um I I don't know how it will end. I don't think there's a clear winner yet. Uh I saw that after a lot of noise from last year from Mistral, now they're releasing a dense model again. Um I think MoE has is more likely to be the future, but what kind of MoE? It's very unclear yet. Mhm. When you say what kind, do you mean like the actual specifics of how they how you implement the routing or or what do you mean? How do you implement the routing? How much do you like can you skip layers? Uh can you have like the mixture of that thing? millions of experts or just eight experts? Um All those like there's some papers around like those different angles. You know, it remind me of the time where just before the transformer, we had so many or at the beginning maybe of the transformers as well, we had so many small improvements one way or another. And maybe we're testing different positional embedding and we're testing like universal transformer with recursive layers and all those kind of ideas. So, maybe we're at this stage for MoE or even earlier. Makes sense. I guess Roberta was your initial exploration of just like a ablating various decisions in transformers. I remember reading it a few years ago. Sorry? Uh the Roberta paper, right? That was one of the seminal works you guys did there. Um Okay. Thomas, if you have something you want to share, feel free to, otherwise I'll be picking between the the questions in the chat and my own. Uh let me see. I mean, there was a question from Jamie that you kind of addressed. How much intelligence do you expect will be coming from further scaling into pre-training and versus smarter scaling? And by that, he means data quality, fine-tuning approaches, et cetera. And why haven't we seen order of magnitude since then? I guess he refers to the parameter size uh of these models compared to the jumps we had like from I guess GPT-3 jumping 10x and then allegedly GPT-4 jumping 10x to 1. 8 trillion or whatnot. Why haven't we seen bigger improvements? Is it just compute, energy, money, or So, wait. To clarify, are you saying like we did scale the weights, but it didn't not lead to more intelligence or quite the opposite? He he thinks we haven't been scaling as fast as in the early days back in 2020, I guess. Well, I think the answer there is just like energy and compute cost and then all of these issues like you we still don't have a 1 gigawatt data center to train Yeah, exactly. 10x the GPT-4 and all of that. I think it's currently hardware and the the the chip supply the shortage there is more the bottleneck, I think, than the theory. That's what I would Yeah, easily. You know, I think we'll keep like scaling next year by another magnitude still. Like so, there's some public announcement about that. There's a Elon Musk said the same training uh model on 100K HTML. Probably open AI is doing the same. So, we will still have that next year. How much is it this trend will be like sustainable is an open question which leads to tiny over ways to scale. When do you think we'll hit like the entropy of the language? Like how much do you think we need to scale theoretically between now and infinity? Where do we stop? Or do you think there is literally no upper boundary of just understanding universe? Uh I have no idea. Um It's a very good question. Like can it come Is it the world of plus infinity or not? Maybe like one answer from first principle is it's bounded on the data it's trained on, which is human text. So, you can't at some point if you ingest all that knowledge and you can like leverage all the intelligence in there, you're probably bounded by the kind of scale by this intelligence. Mhm. Okay, there was one question on the handheld devices. I think you guys recently announced Mobile LM, but let's focus this this discussion on Lama 3 only. Um Okay, somebody asked the question about DPO and PPO. I did see you guys dropped PPO in favor of DPO. I assume it's just the practicality of training with DPO is so much easier and more stable. Is that kind of the reason there or Yes, exactly. I think my perspective on that is it's a magic of RLHF, I said it many times, is shifting the distribution thanks to the human preference. And doing that iteratively so that you have a new model, you annotate with this new model, and you do that again, and so on. So, as long as you have an RLHF that does that, it's cool. And PPO is less scalable than DPO. So, somehow we managed to make DPO work pretty well. That's totally fine. I don't expect like a huge improvement from the algorithmic perspective here other than the principles of annotation.
37:15 Upper boundary for the quality of SX data?
Mhm. What do you think there there's this conversation happening on Twitter on the upper boundary, I guess, of the quality of the synthetic tokens we can expect with these approaches. Like do you see any exciting research uh that is promising to do what MuZero has done compared to AlphaGo? I just go bootstrap above the human level and and just like at that point, you just have I guess infinite amount of data and it's just about compute. We literally will would reduce AI development to more compute and more cluster if we just could generate amazing synthetic data at scale. How do you think about that? Yeah, I think the vision I expressed like for the future of synthetic data, I'm really bullish, but for what I call like augmented language model, it is the sense that punching above the weights of the model. You cannot just generate synthetic data by the model and training the model itself and that's I didn't like There was a recent paper which was nice showing it doesn't work uh about like training continuously and multiple times on their own synthetic data. Of course, it doesn't work. Uh but I saw like then a lot of news saying all that uh training on synthetic data is not working. Of course, it will work, but only if you give the ability of the model to augment itself, to execute code that it generated, meaning that the model anticipate this code will work and will produce some result that you expect. And by executing, confronting to grounding it with the environment, the code execution compiler, seeing the actual execution, or like maybe there's a bug, and then fixing itself and learning from this feedback continuously. It's the way like it doesn't know any information. It knows it doesn't know. It can search this information online and then learn that new information, focusing on what it doesn't know. And only that additional information from entropy perspective will lead to self progress. Mhm. Not in silo. That That's an exciting thought. It's almost like you're setting up a similar type of constrained environment where we know that ML algorithms perform amazingly well in these game type applications that DeepMind is famous for. Like when you constrain it to the code, which is obviously more constrained domain than just like language and open domain uh language um So, creating basically that type of constrained environment, you think will help you to bootstrap the models and increase the the quality of synthetic data. Makes sense. Are you doing something exciting on that front right now, if you can share? Um What I can share is it's definitely something that we will explore. And what I can share is you know, when I did the former, to me the conclusion at this time I think I said that already in the past, is like you like the former was really cool and flagging like with the self-learning a self supervised loss and things like that. But when you have an instructed model, and we have shown that actually like what I'm showing like right now in the paper. A model for which um you cannot have access to tools, you can even like uh in zero-shot tool use basically. Uh is it there? Zero-shot tool use data. When you can basically define just uh What is it? It's you define the list of tools you have access and what those tools can do and how to call them and what are arguments to pass them. And in a zero-shot fashion, the model can, thanks to leveraging those tools given in-context prompts, and um use them in practice. And that's emerge that ability of zero-shot function calling emerge from instruction following. And so, to me, the future of tool former was this instruction following models that then can be provided with any LLMs in a zero-shot fashion, not in a self-supervised manner. So, that now uh you can describe in your prompt like the list of tools and the model will use them when it decide. And to me, it's a valid the future of from Toolformer to Llama 2 and 3 to now back to those new capabilities of zero-shot function calling, leveraging tools to augment the capabilities and create synthetic data. Nice. I agree that's a very exciting line of work. We did have Gorilla authors and I think even Toolformer authors a few months ago. Um some more questions in the chat. Let me see where Did you see any interesting one we can take on or let me just go chronologically? Yep. Aha. This is an interesting one. Uh there was a recent paper that claimed replacing the standard pre-training data with instruction pre-training where the pre-training data is rephrased similarly to fine-tuning data led to performance gains. Do you have any thoughts on this? And is this a direction you'll consider for future Llamas? Can you repeat exactly? So, this is from C Juat. I'm probably butchering his name, but the question is um if you replace some of the pre-training data with data that's closer to the post-training type of uh I guess instruction response type of a format. Did you experiment with that inside of your pre-training corpus? And if so, did you notice did you observe any gains? Uh no, we haven't. We will do it, but we haven't yet. Well, what's your thought? Why would that help other than just like the model being exposed to that type of format early on I don't have Yeah, I don't have a good intuition yet. I think so let's say by two things. Um synthetic augmented data that we just discussed and that done at scale will arguably be like something close to pre-training. Um Now, the data we collected for Llama 3 back to pre-training is an open research question in the sense that somehow the risk is that uh there's some discrepancies and you expect your next model to be better than that. So, you're distilling somehow. Like it's like imagine if we had put Llama 2 data in pre-training of Llama 3. Would that be helpful? I don't know because and the risk I see is um Llama 3 should be better than those data. And so, you're putting worse data than the final model. So, that's something that let me think it's risky. On the other hand, we know that those outputs are now better than humans. So, it's probably better than the average token we have trained them. For that reason, I think it's good. So, there's pros and cons and I don't have a good intuition of which one will lead to better performance. Mhm. One question, my child. We were talking about synthetic data a lot like how do you from the engineering standpoint scale up synthetic data generation to the pre-training stage and generate trillions of tokens? What type of challenges do you think will arise there? Have you done anything on that front? I think the main challenge here is having reliable model that can use um set of tools that make them general assistants in the sense that they are significantly better for all a set of prompts on math, on coding, on reasoning, on uh browsing the web and all those things so that they get close to 100% on MMLU and GSM8K and etc. reliably with the tool set we give them by default and a multi-stage setup. Um You can see the spark of it here in this prompt where um basically you have a prompt that is very complex and require you to browse the web and do some calculus and the model first uh like do a free step calling first browse the web on Brave Search to check US inflation rate get the output now like check something else and do all the calculus and all those things. When this becomes reliable, I'm very bullish on scaling that towards synthetic data. But there's a huge There's two worlds between having something there that performs 50% of the time well and 95% of the time well. Yep. That makes sense. I guess I was more curious just from the engineering standpoint as well like later Do you suspect the compute needed for generating such abundancy of data will be similar to what we currently spend for pre-training stage? Yeah, that's interesting. Um I think we're underestimating massively the inference needs in the future and we will transition more and more to clusters uh for inference that will benefit pre-training, post-training, all those things over than training clusters more and more. I think the industry agrees by and large. That's why people were so bullish when Grok came out and recently the support the edged folks, the Soc chip. Yep. They're definitely companies in the space focusing only on inference because they know they'll capture just like a huge market value there. Um Okay, let me see a couple more questions. Han Chung Lee, you mentioned synthetic data to improve reasoning. Is that specific to code and math or there are other reasoning tasks as well? And what is your definition cut off between pre-training and post-training? Uh I don't think we have to take I'm going to reply to this one in the chat myself. This is just like a standard stuff. Let me see where there is more interesting question for you. Yeah, while you find the answer uh other questions, what I can say at least is uh of course for reasoning it makes sense to use a calculator and code execution while it makes sense for code to use code execution output. But we use as you saw like Brave Search to also improve factuality and knowledge and help on that. So, it goes beyond that. Mhm. There is one multimodal question from Kevin. Uh the question is why train an image encoder from scratch with image six pairs instead of using DINO V2 which has shown to work well on Open the LA benchmark? I didn't work on that part specifically, so I cannot answer. I just don't know. Makes sense. I mean I I haven't read this part of the paper, but I would suspect more much more data available now or or and you want to just pre-train from scratch with to kind of merge with these much more powerful reasoning engines which is the 405 billion parameter LLM checkpoint. So, you probably just have to retrain stuff from scratch there. Um The paper mentions using lower batch size early in the initial pre-training order to improve training stability. What is the intuition behind that? I thought a smaller batches is producing a more noisy gradient signal blah blah. I think the answer here is just like curriculum learning. Thomas, correct me if I'm wrong, but like this is kind of slowly giving the model like an easier task and then you're slowly expanding the context length and and the batch size as the pre-training progresses. Yeah. No, but I don't have more intuitions. I think progressive context length I can give you some intuition, but for progressive batch size, no. Maybe on context length like what's your I I obviously read that part of the paper. I know that the last 800 billion
45:10 Context length
tokens you kind of gradually expanded the context length to 128,000, but like going beyond that to like millions of of like tokens in that context length, what do you think are some of the methods uh you're kind of bullish on? I think that's a very interesting topic because well, what are the three ways basically to get some context? One is inject it in the weights by fine-tuning the model. Two is through the attention. And three is with tool use as like rag in context or things like that. Like control F in your previous messages or anything similar. Right? Uh like agentic kind of where you can click, navigate, browse, scroll down read progressively. Um So, you have these three ways and what is the optimal boundaries between each of them is not clear at all at the moment. Uh of course you want basically to improve on the three different work streams, different aspects, but it's not clear to me if it's very efficient to scale at 1 million token length other than over more optimal methods like agentic ones. However, if it works, it's very useful. And it's even possible that scaling in that in one aspect will unlock actually some other dimensions later. When you say agents, um how does that exactly answer the question of extending the context length? If I understood you, I don't think I I cannot follow you on that front. Well, if I ask you to answer like there was a question in Claude about long context like filling the needle where you have a sentence which was in the middle of a book and the model has to to get it. Now, to answer that, you can put uh the entire context through the attention and the model to identify which one. Or you can say, "Okay, I have 8,000 uh context length. I cannot ingest everything, but I can uh open to find exchange the book as a PDF. I can scroll down, navigate uh page by page, read everything, keep uh on some scratch pad what is a sentence candidate, and take my decision at the end. " Makes sense. I understand now what you're I was more referring to the fundamental limitation of a con because I think it's a fair analogy to compare context length with the in the classical computing era we had RAM, right? Like now it's kind of you don't think about RAM really. And you're early in the days I think people had like approached took similar approaches to solve some of the issues they had, for example, using a small buffer in your kind of reading from the hard drive and in chunks and then processing summarizing whatnot. But that's more of a workaround because you kind of have a bottleneck computes you don't have a powerful system. But ultimately you probably want to work on both of those. You want to increase the RAM, the context length, and you want to do those types of things. But like my my question was more related to the this fundamental limitation and how do we unlock it to to go to millions of tokens. Yeah, but I would like to you that uh there's like sure like I mean basically you can scale the computer hardware to infinite. Uh and yes, I agree with you. But there's a reason why humans are uh have this kind of structure for the memory and the context of the attention our own attention. Like when we read a book, we don't read that all the tokens together. Uh and there's it's probably because we have mechanisms much more efficient but what we are doing now. And so back to the question before about and the discussion and uh what's the limit of scaling? Can we scale and is it like now it's a peak behind us? Uh because like yeah, we cannot scale now 10 times 100 times. We don't have even like the energy to create a cluster of that. My point is maybe there's a smarter way to do it. Mhm. Um that will enable to scale to orders of magnitude more beyond this local minimum. I don't disagree with that. But I also don't think we should anchor us too much in the evolutionary artifacts that are humans. Like just because we are doing it like that in the curriculum learning of human race books appeared very late in the post-training phase, I guess. And Gutenberg press was like 17th century or whatnot. So like it we couldn't really learn that from the first principles. There might be better ways and I think both approaches are significant. Like both increase the capability, increase the context length just like the raw context length, the RAM, and do this type of research that you just mentioned with agents and just summarizing and then kind of doing all that. Yeah, I know I absolutely agree that I think that's why it's very interesting as a question. Like uh what is the best what is the optimal way between scaling attention versus this kind of more human-like agency behaviors. What you said makes a lot of sense. Like we may not be like the optimal uh way to do it uh as humans. And that's why I think from a research perspective it's very fascinating. I truly I genuinely don't know what's the answer what's the best path forward. I don't think anybody knows, but this part is fun. Um okay, let me see where there is something. By the way, if anybody has a question, you can just also raise your hand and ask it. Uh that might be easier. Okay, even go ahead. Okay, how much there is needs to happen in order to architectural change to make it into main llama train. Like group query attention made it very fast. Like like paper was published like a couple of months before like llama 2 paper dropped. And but uh mixture of experts like it still haven't made it. Yeah. I think So what one thing also to consider I guess for llama 3 in particular is you know it was a time where we are also trying to start to merge all the capabilities together in post-training all the uh modalities together and we'll go forward with it more intense about that. But like you know have multimodal models end-to-end and things like that. That requires a lot of work already and we don't have like infinite bandwidth. Um so that's also part of the rationale to not everything. What is a razor? There's no like good set like there's some stuff that in papers that we just didn't have the time to test. Or MOE that requires more adaptation to a lot of layers. Like it's not just during pre-training but also maybe post-training because then we will not take with two different kind families of models. And what is the reason post-training to have a mixture of experts? Maybe like it will be better but maybe worse. And what about like adapters for multimodality? Uh and so for all those questions there's some changes that takes more time. Uh which doesn't mean it will not happen. Mhm. Thanks for answering. Thomas, one question on the initialization uh side of things during pre-training. Like I recently uh implemented μP for the LMC framework. I don't know if you're familiar with it. And we are kind of struggling to to get the performance uh benefits of μP. Did you try did you experiment with μP for the for the latest llama models and and what are some of the conclusions you guys uh made there? I don't know. If some did, I'm not aware. Okay, then I'll just ping you like uh after the call if you can get some information there. Um and then uh question from Michael, what's the framework you guys use? Is do you actually train um
46:40 What framework does Meta use for Llama
llama models? Is it open source? Is it an internal fork? No, it's an internal and it's not even a fork. It's built from scratch. Uh it's a code base built by the original uh llama 1 team uh that is now like probably 100 times bigger than at the time. Uh including our post-training also and all the other stuff. But um it has been made 100% internally. One of the thing also to consider here is we have specific infra uh cluster of GPUs and things like that. So the training code think about it as like for a factory for Formula 1 machine that will never be used by like uh a lot of folks but very tailored for our own needs. And then once the model is ready we like are happy to share it in a way that uh can benefit most like on the fine-tuning phase and all the other partners. Okay, so that's partially the answer why you guys are going for open weights as opposed to code because the code is basically too much too tailored to your particular research cluster as the 24,000 GPUs or whatnot you got with your research cluster. Okay, that actually makes sense. Um it's kind of sad that we are getting into this space where well-resourced companies are building specific setups and you can't really easily collaborate on that front on the infra side. Um Okay, so
51:20 Playing with smaller Llamas
question from Cam, for the smaller models for example, the 7B llama 3. 1. I think it's 8B. But anyhow, what's your opinion on the limitations in terms of quality? Did you get the time to play with those? Uh so it's interesting. I did play a lot with llama during llama 2 and during llama 3. Uh and I think we are at the state which also resonate with what I said at the beginning of the limits of LMC arena. That playing with these models at the level they are now it's barely possible to distinguish like what's a better model than another one because simply we are bad at asking general prompts. We are asking like too simple things in general. And so for llama 2 it was super easy to me to say oh there's an improvement there improvement there. Uh and I had my set of prompts. I modified it in the way but that's it. Now I throw those prompts all the models are slightly different but perfect. And we need to go to the extra next level of like very expert level capabilities or very specific things to measure what are the blind spot in terms of capabilities. I have I managed to but it takes me like 10 times more. Uh it's much more time consuming than before. Makes sense. And I guess that's one of the main criticisms for the arena types of benchmarks. Like people are just like we both collapse with our questions basically. That's the TLDR there. Um question from Evan, what surprised you the most about llama 3 process of creation llama 3? Sorry, say again? What surprised you the most during the the process of creating llama 3 if anything? That it went so smoothly. Nice. I mean at the end we have a model that is close to GPT-4 0 uh significantly better than the original GPT-4. And now a day it seems something like kind of it's done. So everyone considered it as something that can be done. But I can tell you that 6 months ago a lot of folks were actually considering is it even possible that the open source community and like Meta and like can fill the gap so quickly. And I didn't know. Nice. I think I would agree with that. And probably even more so with for Mistral for such a small company to also get there there so quickly. Um Question on like uh from my side is have you I mentioned LMC before. Did you actually play with it internally? Do you think those types of hyper-optimized like uh types of frameworks can actually help you speed up some experimentation internally or whatnot? Say again? So LMC, did you try it You internally? Did you Did Do you know of anyone who played with it already inside of Meta? Playing with Sorry. LLMC, that's kind of the framework I mentioned written in C and CUDA started by Andrej Karpathy a few months ago. Uh no, I don't know. There's probably some folks that tried it, but uh well, we were mainly focusing on Llama 3 anyway, and we use our own infra and compute. Makes sense. By the way, is my sound okay? I'm just Yeah, yeah, just I understood LLMC, so I was Okay, yeah, LLMC. Um let me see. One more question from Demetrios. I imagine creating synthetic SFT data for math involves sampling a question multiple times and only selecting the time where it got right numerically. For non-math, non-coding in Llama 3, how did you assess the quality of synthetic data? I guess, yeah, how do you do evaluations internally? We have a bunch of benchmarks, academic benchmarks, metric accuracy for capability, uh human evaluation as well. So, you know, we create We have the methods that we validate uh at small scale on some ablations that seems to perform well. We scale it for the next iteration of 405B model, and then we expect some improvement. Um is it because of these specific synthetic data? Because at each iteration we pushed a lot of different things. It's hard to know the confounding factors, but we always manage to make it in a way that improve compared to the previous iteration. Maybe last question from my side because I see we're close to 6:00 p. m. and you have a hard stop. Uh
53:20 Multilingual capabilities
did you do anything related to multilingual stuff or you know something on that topic? Yeah, we did actually a huge effort uh to make it better at multilingual you said, right? Yeah. Yeah, so pre-training uh change the data mixture towards more uh friendly imaging. Well, uh in post-training we did a lot of efforts to update also for other languages than English. Uh a human evaluation as well. So, yes, definitely bigger further. I think the results should be for the languages that we targeted significantly better. Nice. What are some of the trade-offs you have to make when you want to inject some of the multilingual corpus instead of your pre-existing, I guess, mostly English-based corpus? Like did you observe any issues with the first multilinguality or when you go to such higher scales you don't care? It's just the model learns everything. The only uh balance I can mention to take into account is not in terms of data and knowledge capacity, but more tokenizer, because you need to have a tokenizer that takes uh much more tokens. Basically, it's a bigger tokenizer, which means um well, it affects the number of steps you will do in total for training. Uh it's um the bigger the tokenizer, the more time the more compute it takes for the loss to back propagate, but also the more um text you can fit into the context for the same context length. So, there's pros and cons, which is related to multilingual. How did you make the the final like decision of how big the tokenizer should be? Like I think it was 128K, right? Yeah, with some experiments, uh some compute like what is the number like all those numbers basically trying to find some Pareto frontier of what we find is uh a good balance. Makes sense. Awesome. Thomas, thanks a ton for joining. Um was very interesting to me just asking all the questions I had, and thanks for joining. Thank you, Alex. Thank you, everyone, and see you. Bye, guys. Cheers. Bye-bye.
Другие видео автора — Aleksa Gordić - The AI Epiphany