# GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

## Метаданные

- **Канал:** Yannic Kilcher
- **YouTube:** https://www.youtube.com/watch?v=Bs6eyNQjGpo
- **Дата:** 19.10.2024
- **Длительность:** 37:06
- **Просмотры:** 20,962
- **Источник:** https://ekstraktznaniy.ru/video/11897

## Описание

This paper (by Apple) questions the mathematical reasoning abilities of current LLMs and designs a synthetic template-based dataset distribution to investigate various aspects around LLM performance of high-school level math questions.

Paper: https://arxiv.org/abs/2410.05229

Abstract:
Recent advancements in Large Language Models (LLMs) have sparked interest in their formal reasoning capabilities, particularly in mathematics. The GSM8K benchmark is widely used to assess the mathematical reasoning of models on grade-school-level questions. While the performance of LLMs on GSM8K has significantly improved in recent years, it remains unclear whether their mathematical reasoning capabilities have genuinely advanced, raising questions about the reliability of the reported metrics. To address these concerns, we conduct a large-scale study on several SOTA open and closed models. To overcome the limitations of existing evaluations, we introduce GSM-Symbolic, an improved benchmark created from

## Транскрипт

### Segment 1 (00:00 - 05:00) []

hello there today we're going to take a brief look at GSM symbolic understanding of the limitations of mathematical reasoning in large language models this paper is out of apple which uh continues to slowly but surely enter into the research area obviously with uh personal Acquisitions like Sammy Benjo who was previously at Google this was to be expected but still very cool that more companies are coming out and doing research in this field uh especially now that the traditional companies such as open aai and Google are going more out of the research fields and only releasing technical reports AKA advertisements about their uh new apis so what does this paper do in short this paper questions whether reasoning is happening in large language models especially as it pertains to mathematical reasoning they also ask a little bit the question are the current benchmarks uh notably the GSM 8K Benchmark sort of part of the training sets of a lot of these models because some of their experimental evidence suggests that the models might already have knowledge pre- knowledge of the questions and lastly they investigate you know how robust are llms to kind of this mathematical task do they really understand these things or do they just sort of do pattern matching and the conclusion that this paper comes to is obviously that oh no the llms aren't reasoning uh they are just quote unquote pattern matching and also there's probably quite considerable amount of training set poisoning or test set poisoning however you want to call it of this GSM 8K Benchmark and therefore new data sets new benchmarks are needed and they provide one of those this GSM SYM symbolic is one additional data set or methodology of creating synthetic data that's supposed to prevent this um type of training set conflation or test set conflation all right so a lot of stuff packed into one paper nevertheless it's quite a short paper and I expect we won't spend too much time the problems I have with this paper are twofold on one hand I do believe their data set construction of this GSM symbolic data set is not without problem and we can discuss that so some of their conclusions I would put like a question mark behind them uh just because of how they constructed their data set and I don't agree with their kind of assumptions behind that and secondly uh their whole point about reasoning as such now it's totally fair to provide research that kind of shows the weaknesses of llms in a given task and also maybe even why those weaknesses appear as this paper does but then to draw kind of conclusions of oh it's not reasoning it's just pattern matching and so on without defining reasoning without defining these terms well is a bit Shady so the question you have to ask yourself is are humans reasoning just try to and then you're going to make first make a joke and say oh well some aren't haaha but you know um if you actually think about it you probably say yes you probably say humans uh sort of the human brain is has is reasoning in its assessment of problems now this is a challenge for this paper because I would say yeah if you gave the tasks here to a human they would probably fail in a similar way and therefore if this paper concludes that since the llms failing these ways they aren't reasoning and just pattern matching I don't know what does that mean for humans well okay let's cut it down they are producing a new data set and the data set is synthetic made such that you can kind of produce endless variants of uh the same problem but with kind of minor changes to explore how robust llms are to these changes so they're going to take uh a example from GSM 8K which is a human created data set of mathematical tasks so little high school math questions uh for this when Sophie watches her nephew she gets out of yada this has 31 blocks in it uh animals has eight stuffed animals inside this has nine multicolored rings so it's questions where you need to know the four basic operations like addition subtraction multiplication and Division

### Segment 2 (05:00 - 10:00) [5:00]

in order to reach the answer and it's packed in a little bit of text so you have to parse out how the different things relate then do the calculations and after that you'll get an answer so the data set has the question the data set also has uh the kind of solution process um annotated and then obviously the final solution here so this is a good data set for exploring these topics because it's obvously relatively easy to automatically check whether an llm did the right thing which in the llm world is considerable challenge to check whether the output is correct or not because you can formulate stuff in many different ways but having you know mathematical tasks makes this a lot easier so what they're doing is they always going to take a sample of GSM 8K and then kind of make a template from that so you can see here uh they annotate these kind of things so in this case it is uh names and entities and so on but then also different numbers so they give the these all names you can see right here the numbers also get variable names and then they Define how these things could be filled in so for example a name uh could be any of a list of names they have internally then family relationship uh could be in this case you so the the person who kind of makes the template here um provides these things and essentially says well what are the things that can be filled into the template you can see especially for the numbers obviously you can kind of give ranges uh between which the numbers can be filled in and then you can have conditions so you can the condition uh is ob one of the conditions is always what is the solution so in this case the uh solution must fulfill this condition right here um but you know you could have different conditions so that the problem makes sense or something like this but this is the basic structure of what they come up with so they take a sample of GSM 8K make a template out of it that annotate the template with the sort of valid values and conditions on the template parameters and after that they have a you know they have a template um and they can generate as many variants of this as they want but note it's always kind of the same like the text in between the template variables is always the same so they're taking I believe uh a thousand or 100 of these 100 of GSM 8K and make these templates or a thousand and then make always make um let it somewhere always make 50 I believe variants of each so you end up with 50 uh data sets essentially so here for this work we conduct nearly 500 total evaluations a manageable data set size using 100 templates and generating 50 samples per template resulting in 5,000 total templates for each benchmarks therefore we have 50 data sets of 100 examples each where each example is a mutation of one of the original 100 samples from GSM 8K so there are 50 data sets each of these 50 data sets contain the exact same task uh tasks so task number one is always based on the same GSM 8K template but filled in with different template placeholders and therefore is kind of a different task now the first question is okay let's give all of these 50 data sets to the llms and let's see how they perform and you will be either shocked or not shocked I don't know what you expect but it turns out that most of the LL M first of all there's two effects you see that here most of the llms have a quite a variety of variance in their final performance so you run the 50 data sets for each of you um calculate the mean score and then you plot that in a histogram and you can see that the variance of these things is huge so uh from a score of 70 to a score of 85 that is a much broader range than the individual llms are a part from each other on the leaderboard right also you know you can see that for a lot of the models now to be said the stronger models so the larger models for example GPT 40 you can see right here their variance as you can see right here is quite a bit smaller um than the smaller models if you will or weaker models however you want to you

### Segment 3 (10:00 - 15:00) [10:00]

or weaker models however you want to call it so that's one thing that they point out is that the variance is really large um but the kind of bigger model like GPT 40 is smaller although I do have something to say about that but that's we'll get to that in a bit when we talk also about the drop in accuracy yeah oh we'll do that now so the second thing as you can see the dash line here is always how well they did on the original data set so on the original GSM so if you give them the original tasks that they're derived from how well are they doing and you can see for a lot of models that is quite a bit better than these data set variants so that's the first indication that where they say hey um there's probably something where uh where that the models already knew about the data set right like why else would it be so much better on one the that one particular variant rather than all the other variants so they essentially say look our data set is essentially a distribution of data sets and the original GSM 8K is basically a single draw from GSM symbolic so where would expect that uh This falls somewhere in the middle but it doesn't it tends to be on the very right hand side meaning that oddly in this single draw from this distribution the models perform significantly better than they perform in all the other draws of the distribution except a few outliers you can see like GPT 40 and so on so this is a graph where they show how much each of the models kind of drops on average compared to their original performance on gsm 8K so you can see this is gsm 8K to GSM symbolic accuracy drop sorry I having trouble this is a bar like this um yeah you can see that the that some models like um GPT 401 mini and so on they have a relatively small drop in percentage but then you have you larger and larger drops as I said especially as the models get smaller now I have something to say about this graphic what they're making it seem like um there's a difference in how much uh these models drop and that's true but also as you saw before for example GPT 40's Baseline like GSM 8K Baseline performance is already at 95 whereas for example um Gemma 2 here is its Baseline performance is at only 80 something right and then 535 its Baseline performance is also like at 87 or something like this mistol or math strol is at 80 so compare a model uh that is at let's do it really extremely so let's say model one is like 99% accurate and model 2 is 10% accurate it right now imagine what a drop by 1% and it's unclear what 1% means in this sense but let's just say they drop by 1% point right so this goes from 99 to 98% accuracy and this one goes from uh 10 to 9% accuracy if you look at the other way around the error so this doubles the error right so the how much error M1 makes is doubled this one here the error was at 90% And it's now at 91% so this is barely like a 1% incre this is a 1 to 2% increase in error if you look from that way the models who already start at a much higher Baseline performance you kind of have to normalize their accuracy drop if you will uh by their you know how much error are they making it I think that helps a lot more than looking at what their score is in looking at the reverse what their error is and then normalizing by that and then I don't know if you can make too much sense out of this graph if you rescale it by the error they could just all be relatively constant in how much they drop and likewise if you rescale this variance here by the Baseline error that the models make then it you could just

### Segment 4 (15:00 - 20:00) [15:00]

find that well all of them exhibit about the same variance normalized by you know where they are on the error scale so that's it's kind of um so I just there's no big Point here I just wanted to point out that graphs like this where you look at relative performance of models that already start at different starting points can lead you to different conclusions depending on how much how you scale and what your relative comparison point is so keep that in mind that being said they obviously say hey look since all of these kind of drop uh it could be that the data set is already the test part of the training data of these models now I have maybe a bit of a different hypothesis and they are not exclusive so both could be part of this but if you look at how they construct their data set I want to challenge a bit that they say well our data set is es essentially a distribution of data set and the original data set is just one draw from this distributions and I would at least slightly disagree with that why because the original data set was made by humans and humans when they do little math exercises they will naturally um they will naturally kind of put the numbers so that they kind of make sense both to each other and in the real world and also maybe that so that maybe it's a bit nice to compute but especially the first two right if you say well I don't know the electricity bill is this high and one you know kilowatt hour of power costs this much and those numbers when human make these exercises the numbers that come out kind of make sense in the real world right for example um in this case right here so uh right um the uh the bin of stuffed animals has X animals inside the Tower of stacking rings has X multicolored rings on it you can see the Ranger goes range Z go up to 100 so which stacking tower of rings has 100 rings on it so um I'm not hopefully not nitpicking here and maybe this example is just one example but you can see that especially then if you just sample from these ranges right then the relations between them are also completely inconsiderate in these conditions down here you can end up with questions that would you would kind of be like really why why is someone buying 3,000 lers of milk exactly to go with one box of cereal and um why am I saying this because llms aren't just Mindless calculators llms are trained on human produced text largely they are trained to act and predict next tokens in a world of text where that largely is from humans for humans and largely describes the real world so also they will be more comfortable in let's say real world circumstances where things quote unquote make sense so I hypothesize that at least some of the drop in performance if you will comes from the fact that their template generated data isn't from the same distribution as the original GSM 8K but is from a slightly different distribution where illogical scenarios are as much part of the distribution as logical well kind of uh World fitting uh scenarios are as much part of the distribution as world ill-fitting scenarios whereas in the original data set they are naturally World fitting because humans produce them so I hope that is kind of brings the point across the other thing is this GSM um symbolic template they are kind of half done half automated so they're kind of half automated and then kind of checked by humans but also checked by two Mo like if less than two models pass them then they're checked again and so on so it's kind of a Hal half automated process to even come up with the data set and then obviously sampling from the data set is a fully automated process all right let's go on

### Segment 5 (20:00 - 25:00) [20:00]

they do an interesting experiment where they say Hey you know we just kind of sampled all of these placeholders but we have distinct placeholders some placeholders are numbers right I can change the number of stuffed animals the number of rings on the tower and so on or I can change the name so instead of Sophie it's John instead of Sophie's nephew it's Sophie's brother or something like this what if I separate those and I research those individually you can see that if you only change the uh the names um then the language models tend to not drop inaccuracy in fact some of them are even better than their Baseline performance so if you just change the names right here uh the green bumps then everything kind of is fine you still have the big relatively big variance although not as much and everything is kind of fine however if you change the numbers right see the blue hump here then performance drops significantly also here you know either could be an indication that yes indeed the models have seen the test set in their training data and they're just kind of recalling the oh this is the one about the Rings and the stuffed animals and I recall that the answer was 14 and therefore if you change the names they still get it correct or it could be that you know because if you change the numbers you make some of these samples kind of illogical it could be that um that that's part of the reason I think something that does support at least my hypothesis a little bit is that if this was really a function of remembering uh you would like I don't think the variance would blow up so much you know I don't think the variance would increase you would maybe see the same bump but maybe lower no um although I do have to point out obviously here you can see the same thing the lower these bumps go the bigger variance they have and again if you think of just rescaling it by the Baseline error rate it is not clear that variance would also increase in that rescaled version in the rescaled version these bumps would probably or maybe all be relatively the same um but yeah I would expect even in this um in this scaling right here if it was a pure remembering thing you would just take that bump and you would just shift it out because it's always going to get give the same answer right because it remembers and then so it will be kind of the same distribution except worse no that doesn't make sense because the distribution is how much it got correct yeah it could be either um I'm going nowhere with this I'm sorry but I yeah I hope you can see how again interpreting these graphs is not as straightforward I find and the paper just you know chooses one view here all right then they go on and say we can now um we can now change the data set slightly in that we can make this stuff less and more difficult and by we don't mean oh it's weirder numbers or something like this um oh so um but they can now change the questions for example to take away a condition uh or one element of the question so you have to do less math operations and consider less things or they can add conditions to it so here you have an example this is a call from a phone booth you have to pay this much for each minute of your call the price drop after 10 minutes uh how much would you pay for a 60-minute call cost so in this sense they could either drop this after 10 minute price drop or they could introduce a new price drop even later so after 25 minutes uh the price drops even more and they can do it twice so they can say well after 25 minutes from the start the price drops even more and if your total bill is more than $1000 you get a 25% discount so the second uh bit here is an even new condition so this is kind of minus one condition easier plus one condition harder plus two conditions the hardest so now the question is how do the models perform if you make stuff easier and harder again let's first look at the results so predictably um stuff gets worse and what they argued they kind of argue that well if the

### Segment 6 (25:00 - 30:00) [25:00]

models would understand that would reason then it shouldn't matter how many conditions you put here it's all just you know you just map them to plus and minus and you'll be fine so if these were actually reasoning models then this drop in uh accuracy wouldn't happen they again point out that the variances go up as you go to the left but again I feel like rescaling them by error percentage could just negate that uh outright what I do I find interesting two things what I do find interesting is that for some models you can see making them questions one harder doesn't really affect them but making them two harder does affect them and that could also be a property of data set construction so obviously this here you have to put a bit more thought into how you add these difficulty levels I do believe this particular example here is a good example of how that can kind of quote unquote go wrong or something I feel like if you're tasked with introducing a new difficulty you could if you look at this right you after 10 minutes the price drops by this much per minute you could be just like oh let me introduce another price drop after more minutes right that's kind of like an added difficulty okay and then someone asks you well make another difficulty level and then you've kind of like as a human you're kind of like well I've already done the price drop again so let me think of something else and then they introduce the oh if your bill is more than $10 you get a 25% discount so it could be that just by the way they made this data set um the plus two here isn't just twice as hard but is it um encourages the data set makers to introduce kind of like fundamentally different problems like the here is you have to do a comparison and a percentage discount versus the same Concepts that were already in the question present so in this case you know this is just a continuation from the condition that was already there so it could be that this plus one + two thing that they are trying to do here is not just you know we make it one harder and two harder but that the two harder is kind of a different nature often so in a couple of these questions leading to the fact that for some of the models making them one harder drops not at all or yeah not at all and making them too harder does in fact drop the accuracy second I want to say that you know in this territory here you are at the limit of what the regular human can do like you give this to like a random person on the street like this bottom down here I'm telling you they like a lot of them will have trouble especially if they have to do it in their head which the llms they don't get help right um and especially if you then put them under a bit of time pressure how many would actually get that uh I don't think so I don't think that the humans a lot of humans or the humans would score like 100% here so again the question are humans reasoning because if you conclude this here is a good indication of whether stuff is reasoning because oh just adding conditions shouldn't really do anything it's because you really just have to map it to and so then I'm sorry but then humans aren't reasoning um and yeah let's write big papers about oh humans aren't reasoning I like I don't expect I don't get why we expect the LMS to do that uh in this case it seems to they're very human it seems to me um yeah so that those are my thoughts on this uh and then there's the last experiment where they introduce uh noops so they make this GSM noop so they just kind of introduce random facts that they have no uh effect on the answer like okay you pick this many kiwis you pick double the kiwis and then you say well five of the kiwis were a bit smaller than average and then a lot of the models kind of f on that is say oh well the five of the kiwis were smaller we need to subtract them from the Sunday total or you five of the kiwis yeah also the 01 or llama here we need to subtract five from the total number of kiwis because they're a bit smaller um yeah so they discover that the uh the models they don't do well on this no update set so they kind of drop in performance even if they use different kinds of multi-shot um Chain

### Segment 7 (30:00 - 35:00) [30:00]

of Thought So by the way all of these experiments are doing uh eight shot Chain of Thought as kind of his standard for this data set so they give them examples in the context and here even if they give them examples of the exact same question with the noops in just kind of like change the numbers a lot of them don't cannot solve the question even if they have eight demonstrations of how the noop is uh successfully ignored if you will so the noop here and then the these are where they give explicit examples with noops and sometimes even the same question except you know different numbers with the noops in so you have eight demonstrations of how to ignore it and most of the models aren't good at picking that up and are still kind of failing noticeably not all the models so some of the models like Gemma 2B if you actually give them a Chain of Thought um examples where they are explicitly shown how to ignore this particular piece of irrelevant information they can do so uh but um not anything El but that that's kind of like cheating right because you know um but all the models drop and this is an interesting bit first of all again if you give this to a human and you put like the irrelevant information like this stuff a lot of humans would make use of it somehow try to get this into their answer they be like what do I do with this do I need to subtract it or something like this like a lot of them would now if you demonstrated it to them eight times how you ignore it then maybe not right but you probably have to explicitly point out I am ignoring this right here because it doesn't you know because I bet you you just give this to a human on the street like at least half of them will somehow do something with the five of the kiwis that were smaller than uh average like promise and you come back to the thing like our humans reasoning uh if yes then you have a problem with the conclusions of this paper uh because most of the humans wouldn't solve the problems um if no then you know what is reasoning and how do we Define it and why do we even care if llms do it because they're doing Mak doing the same things as humans and then it's just about what is it then about expecting llms to be calculators to be super formally deriving computers that's not the point we have computers we have programming languages we have regexes and parsers and all of that if you want that so in my mind I don't know people love to complain especially about this oh it's not reasoning it's just P matching but I don't see the point another thing interestingly that has been pointed out on by a member of our Discord when we discuss this paper by the way we have lots of paper discussions on Discord uh everyone is welcome um and every if you want to present the paper uh that you find interesting that that's the place um someone actually pointed out that not only did they didn't really show that llms can't reason cuz they never Define reason but what they did kind of show is show that llms are really bad at pattern matching even though their point is kind of oh they only do pattern matching no they don't because even if you give them eight demonstrations of how to ignore this particular piece of irrelevant information in their context they can't do it a lot of them just can't do it and what better demonstration do you need to show you that llms suck at pattern matching uh so I don't know I don't know but yeah now obviously what happened is that their training data makes them want to consider this extra information and no amount of patterns in their context is overriding this in this case but still um they suck a pattern matching so where does that leave us that leaves us with a bunch of conclusions that the paper has which again uh exposes a critical flaw in the llms ab ability to genuinely understand mathematical Concepts and discern relevant information for problem solving like uh yes but also the limitations of the ability of LM to perform genuine mathematical

### Segment 8 (35:00 - 37:00) [35:00]

reasoning but what about humans like the high variance in LM performance on different versions of the same question their substantial drop in performance with minor increase in difficulty and their sensitivity to inconsequential information indicate that now before you continue all of these are properties of humans as well their reasoning is fragile it may resemble sophisticated pattern matching more than true logical reasoning right so we remind both GSM 8K and GSM symbolic include relatively simple great school math questions requiring only basic arithmetic operations at each step of these models likely to be more pronounced in more challenging mathematical benchmarks yeah develop AI models capable of more formal reasoning and that is where I don't know we have formal reasoning engines why do we need llms to do that humans suck at it why do we need aren't llm supposed to be doing the things that humans are good at but machines were thus far bad at um yeah so as we strive to create systems with humanlike cognitive abilities or general intelligence yeah that that's like that's the disconnect right like I don't know why people assume that o these things are any indication of human like reasoning abil cognitive abilities because it seems to me the llms are much more like humans demonstrated by this paper that's what it seems to me all right so enough of the rant I read the paper in full they have they do like experiments are good they have full reports in the appendix and whatnot I just I just feel their conclusions are a little bit um yeah you can come to different conclusions based on the same data that's it all right tell me what you think in the comments I'll see you around bye-bye