# Nvdias New Open Source Model Surpasses Gpt4o And 3.5 Sonnet....

## Метаданные

- **Канал:** TheAIGRID
- **YouTube:** https://www.youtube.com/watch?v=dK-KizVRn4c
- **Дата:** 16.10.2024
- **Длительность:** 12:01
- **Просмотры:** 58,367
- **Источник:** https://ekstraktznaniy.ru/video/13985

## Описание

Prepare for AGI with me - https://www.skool.com/postagiprepardness 
🐤 Follow Me on Twitter https://twitter.com/TheAiGrid
🌐 Checkout My website - https://theaigrid.com/


Links From Todays Video:
https://build.nvidia.com/nvidia/llama-3_1-nemotron-70b-instruct
https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Instruct-HF

00:00:00 Model introduction
00:00:37 Benchmark performance
00:01:35 Surpassing GPT-4
00:02:27 Reward modeling
00:03:37 Dataset innovation
00:04:34 Performance results
00:05:37 Style control
00:06:33 Practical testing
00:07:38 Reasoning challenge
00:09:18 Prompt engineering
00:10:37 Counting ability
00:11:52 Future implications

Welcome to my channel where i bring you the latest breakthroughs in AI. From deep learning to robotics, i cover it all. My videos offer valuable insights and perspectives that will expand your knowledge and understanding of this rapidly evolving field. Be sure to subscribe and stay updated on my latest videos.

Was there anything i missed?

(F

## Транскрипт

### Model introduction []

so Nvidia have just released their stunning llama 3. 1 neaton 70 billion parameters instruct model and crazy enough this model surprisingly beats every closed Source model so it seems like once again open source has raced forward despite closed Source efforts to best them in terms of which model is currently state-of-the-art so there's going to be a lot to dive in for this video because they did introduce a new technique with how they produced this model and I think it's rather interesting with how they managed to do

### Benchmark performance [0:37]

this and it's worth noting so you can see right here it says Al Lama 3. 1 neaton 70b instruct model is a leading model on the arena hard Benchmark from LM Arena AI so let's get into exactly what was done okay so essentially if you aren't aware basically Nvidia used the Llama 3. 1 model as their base model and then of course they did some post training on that model I believe with reinforcement learning and that kind of reinforcement learning managed to get this model to surpass state-of-the-art Clos models and this video actually does have some of my own tests that were quite surprising so I think AI is about to get even more insane so for those of you who do want to see the actual benchmarks of the model you can see here that we have llama 3. 1 neaton 70b performing at 85% 57 8. 9 on the Mt bench

### Surpassing GPT-4 [1:35]

and you can see that this actively surpasses all prior models so this one is really surprising not just because of the fact that it surpasses claw 3. 5 Sonic but also surpasses GPT 40 which is recently been debuted by open a as the Frontier Model that can do a lot more than just Tech so I think this is pretty insane because not only that as well now that I'm looking back at this I can also see that this manages to surpass the Llama 3. 1 45b instruct model which is a monumentally larger sized model but somehow training the model in a certain way has allowed them to you know perform this kind of feat where it surpasses the close Source model so certain ways that you find you the model are going to be really impactful on the results now

### Reward modeling [2:27]

essentially there was this paper that they put in the hugging phase description and basically what they did that was really different to everything else was they introduced an advanced reward model used to improve the alignment of the AI models with human feedback and we'll get later on to that because humans really do love the responses of this model so essentially the researchers addressed two main approaches to reward modeling which is the Bradley Terry style and the regression style now both these methods are used to guide AI mod models to provide more useful and accurate responses by assigning them reward scores based on their performance in following instructions the bradle Terry model focuses on comparing responses to prompts and identifying which one is better while the regression model predicts a numeric score for the response based on several criteria like helpfulness or correctness now what's crazy about this is that they faced a challenge because of course you've got these two different ways to of course guide them models but the challenges are is that these models are often trained on different types of data which makes it really hard to compare them directly

### Dataset innovation [3:37]

so this is where nvidia's genius comes in to overcome this the authors of this paper presented a data set called Help steer 2 which includes both types of data preference rankings for Bradley ter and ler T scale ratings for aggression so this new data set helps bridge the gap between both these approaches allowing for a more comprehensive comparison so overall what they manag to do here and why this model manages to surpass state-of-the-art models is that they use reward models which are used to help produce better responses by scoring ai's output and guiding the model responses and of course they decided to use a new data set which is help steer two which has both of these preference rankings and numeric rankings SL ratings to help train reward models more effectively so this new combined reward model achieved top scores on a benchmark called reward bench and basically by combining these methods they managed to outperform state-of-the-art systems

### Performance results [4:34]

now we can also see nvidia's models performance on The Arena hard Auto you can see right here that the arena hard Auto is an automatic evaluation tool for instruction tuned llms it contains 500 challenging queries from the chatbot Arena and they prompt GPT for Turbo as a judge to compare the model's responses against a baseline model now the arena hard Auto actually has the highest correlation and separability to chatbot Arena among popular open-ended llm bench marks so when we actually take a look at these results I truly find them pretty fascinating because on this leaderboard we can see that on the leaderboard with no style control and essentially what style control is for example sometimes how you message Chat GPT and it will give you a response in a certain format whilst the data might be the same in another chat with another different AI system the format can alter how humans view the helpfulness of the response for example in certain

### Style control [5:37]

responses you would rather bullet points and just a sentence but we can see here that overall llama 3. 1 neotron 70b instructs scores just two points above GPT for Turbo and other models and surprisingly we can see a large number of models on that list although we don't see Gemini's recent models so I would like to see if we do get that model there but it's scoring just behind 01 mini and 01 preview which is rather fascinating so of course if we take away the style we can see that these discrepancies are a little bit more pronounced but I would say that a 70b model doing this well is still a remarkable feat because it does mean that with how you manage to you know guide these models after they've been trained completely it shows that we can still achieve marginal gains that could potentially catch up to state of the art

### Practical testing [6:33]

so things like these are really surprising now some of you guys might be wondering okay Nvidia have done it they've managed to get this model to perform better than state-of-the-art but benchmarks are good but how does it fair on certain questions that you might ask it so of course you could always test the model yourself but it's completely up to you what kind of questions you ask it now what I wanted to do is I wanted to ask it some questions from a research paper that I covered in a video 3 days ago so you all know that this well you might not all of you watch my videos so you might know that there is this thing called the GSM no oop basically this is from a research paper where Apple said that llms don't reason the first question you're looking at here is a reasoning based question that asks a question and in that question it has some information that is just completely irrelevant so you can see here that it basically asks about how Liam wants to buy some school supplies and the irrelevant information is highlighted in pink it basically says assuming due to inflation the prices were 10% cheaper last year how much

### Reasoning challenge [7:38]

should he pay now but the question starts with the information that you already need it basically says these are the prices of the products that he's checking out with now and then add some random information about inflation to see if the models get confused interestingly enough we can see the 01 preview open ai's best model manages to unfortunately get confused and does the wrong calculations I think this is what happens when you have a model that is rewarded for reasoning and their reasoning steps and sometimes you don't necessarily need reasoning steps you just need the model to you know really look at the question and answer the question and figure out what the question really demands of you so I did put this question in to the 70b model because I wanted to see if it gets it initially right and unfortunately if I'm being honest it didn't get this question right now that doesn't mean that this model is awful at all remember this research paper was a complete shock and most models will get this question wrong but what I did do was I did Implement some information from a different research paper that actually says to have this simple thing and I think you guys need to do this as well if you have any hard difficult reasoning question that you really want an AI to perform well at perform this small step and you're more than likely going to perform a lot better on your outputs from the model all I said was reread the question okay there's a research paper I don't know where this research paper is or this blog post but I saw a blog post and it basically says ask an llm to reread the question and it improved reasoning by like 10 to 15% so all I did was I asked this smaller model to perform this reasoning step and I said look just reread the question okay and I didn't I

### Prompt engineering [9:18]

reread the question okay and I didn't even give it anything else in that all I said was just reread the question okay no additional context like you know what about this and that um and you can see that after asking the model to just reread the question it managed is to actually realize that the information at the end of The Prompt is wrong you can see right here it says that the question asks for the amount Liam should pay now and provides current prices and it says that information about the inflation rate does not affect the calculation for what Liam should pay now as we already have the current prices so this is what I'm saying that like sometimes there is the smartness inherently built into the model you just need to be able to prompted out which is some clever prompt engineering so I think this model is really smart as well because I did give it a question the open a 01 field it wasn't this question but it was this question okay now this question was one that was once again containing some irrelevant information it was just simply asking about picking kiwis and then of course the number of something doesn't vary if some of them are smaller than average but this was a key piece of information that 01 mini manages to miss and other models manage to miss and when I actually gave this question to this new model surprisingly we can see that this model manages to understand that Oliver has a total of 190 kiwis and the size variation of kiwis on Sunday is noted but does not impact the overall account

### Counting ability [10:37]

so it's clear that this model does reason a little bit more than usual because of the way that the reward modeling was done and another thing that this reward modeling has trained the model to do you can see that it says we adopt a prompt that many have been recently Vibe testing LMS with how many RS are in strawberry and among table five only reinforce which is the method that they used can correctly answer it you can see that these other methods and these other models actually clearly fail but the method that they're using which is right here manages to answer this question clearly you can see GPT 40 says there are two letters Claude 3. 5 Sonet 405b says there are two RS llama 3. 1 but the method that they're using managed to count them and then of course manage to actually get the correct number so overall it seems that open Source models have once again raced ahead of closed Source in terms of their capabilities but this leads me to believe that larger Frontier models that have even better reasoning might just be around the corner as the last time we saw open source managed to catch up to closed Source we saw a sharp jump in performance across closed Source companies revealing their next iteration

### Future implications [11:52]

of models so that being said if you enjoyed this video let me know your experiences with llama 3. 1 70b instruct and I'd love to know your thoughts