# Scalable MatMul-free Language Modeling (Paper Explained)

## Метаданные

- **Канал:** Yannic Kilcher
- **YouTube:** https://www.youtube.com/watch?v=B45FlSQ8ITo
- **Дата:** 08.07.2024
- **Длительность:** 49:45
- **Просмотры:** 35,111

## Описание

Matrix multiplications (MatMuls) are pervasive throughout modern machine learning architectures. However, they are also very resource intensive and require special accelerators (GPUs). This paper explores architectures that do away with MatMuls and use quantization and recurrence to keep performance up.

OUTLINE:
0:00 - Intro
2:30 - MatMul is everywhere
5:55 - Ternary accumulation as a substitute for matrix multiplication
16:35 - Replacing attention layers with recurrent layers
32:40 - Replacing dense layers with ternary channel mixing
38:30 - Language modelling results & scaling laws
45:00 - Other experimental results
48:20 - Conclusion

Paper: https://arxiv.org/abs/2406.02528
Code: https://github.com/ridgerchu/matmulfreellm

Abstract:
Matrix multiplication (MatMul) typically dominates the overall computational cost of large language models (LLMs). This cost only grows as LLMs scale to larger embedding dimensions and context lengths. In this work, we show that MatMul operations can be completely eliminated from LLMs while maintaining strong performance at billion-parameter scales. Our experiments show that our proposed MatMul-free models achieve performance on-par with state-of-the-art Transformers that require far more memory during inference at a scale up to at least 2.7B parameters. We investigate the scaling laws and find that the performance gap between our MatMul-free models and full precision Transformers narrows as the model size increases. We also provide a GPU-efficient implementation of this model which reduces memory usage by up to 61% over an unoptimized baseline during training. By utilizing an optimized kernel during inference, our model's memory consumption can be reduced by more than 10x compared to unoptimized models. To properly quantify the efficiency of our architecture, we build a custom hardware solution on an FPGA which exploits lightweight operations beyond what GPUs are capable of. We processed billion-parameter scale models at 13W beyond human readable throughput, moving LLMs closer to brain-like efficiency. This work not only shows how far LLMs can be stripped back while still performing effectively, but also points at the types of operations future accelerators should be optimized for in processing the next generation of lightweight LLMs. Our code implementation is available at this https URL.

Authors: Rui-Jie Zhu, Yu Zhang, Ethan Sifferman, Tyler Sheaves, Yiqiao Wang, Dustin Richmond, Peng Zhou, Jason K. Eshraghian


Links:
Homepage: https://ykilcher.com
Merch: https://ykilcher.com/merch
YouTube: https://www.youtube.com/c/yannickilcher
Twitter: https://twitter.com/ykilcher
Discord: https://ykilcher.com/discord
LinkedIn: https://www.linkedin.com/in/ykilcher

If you want to support me, the best thing to do is to share out the content :)

If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: https://www.subscribestar.com/yannickilcher
Patreon: https://www.patreon.com/yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

## Содержание

### [0:00](https://www.youtube.com/watch?v=B45FlSQ8ITo) Intro

hello everyone today we're going to look at scalable mapol free language modeling by researchers of UC Santa Cruz sucha University UC Davis and Loxy Tech this is a paper that replaces every single matrix multiplication operation inside of large language models with something else that is more efficient from a computational standpoint they specifically replace feet forward layers in Transformers with uh tary accumulators and then they replace the attention part of Transformers with a form of a tary uh recurrent network but that is parallelizable we're going to look into these things uh they the paper essentially draws a couple of ideas together from other papers such as bitnet and uh RW KV and kind of new age RNN like that and pulls them together to produce large language models that are completely M Mo matrix multiplication free and therefore promise to kind of be a lot more Hardware efficient than current large language models they do train these models and their scaling laws indicate that there's a crossover point at some point where these models actually become better for a given amount of flops or well I don't know what you call them at this point uh of operations for a given amount of compute invested these models will actually become more efficient and better than the current language models and that's kind of an exciting Outlook however that is a projection and I do believe it's worthy to a remain skeptical and B recognize that there are some roadblocks in the way of you know um of models like the these ones here notably the hardware Lottery is playing heavily against these models which is why they also Implement a uh a variant on an fpga essentially saying what if we could build custom hardware for this type of model well then you could make use of all of these benefits but right now we have gpus so let's directly dive in so matrix multiplication obviously they say the

### [2:30](https://www.youtube.com/watch?v=B45FlSQ8ITo&t=150s) MatMul is everywhere

dominant operation in most neural networks nowadays whether you're doing convet whether you're doing rnns whether you're doing dense you know MLPs or in our most cases nowadays Transformers matrix multiplication is everywhere in the simplest case a Transformer is a stack of two types of layers so you have some sort of signal X that goes first into a like multihead self attention not self just multihead attention architecture that multi-head attention essentially looks like this we're going to take X and we're going to calculate three different matrices Q is x * some sort of weight Matrix K is x times and uh W uh sorry V then we're going to take k and V and Q calculate the outer product that gives us a big attention Matrix that goes through a soft Max operation let's call that a prime and then lastly uh we accumulate V according to a and um and uh yeah that becomes our output of the layer like some something like this uh I hope you can somehow make sense of that we've discussed Transformers and attention mechanism multiple times already you see many things here actually is notably lots of Matrix multiplications involved uh this here is not really a matrix multiplication because this a right here is then going to um for a single token going to just turn out to be kind of a weighted sum but also that if you do it across you can interpret as a matrix multiplication so all of these things he matrix multiplication heavy then the linear layer is not much different so you're going to take your signal X and you're going to just multiply it by like a big weight Matrix and that is going to give you an output Y and you can do that in various ways with up projection and whatnot but also here you have heavy Matrix multiplications involved and these are costly um these need you know big accelerating chips and what if we could get rid of them so that's what this paper asks the main staple of this paper here is going to be uh an idea that comes from this paper called bitnet I believe that's a Microsoft paper uh where they did replace the dense layer weights with binary and orary values and they're saying bit net replaces part of the Transformer architecture with sort of heavily quantized weights uh but retains the self attention mechanism which relies on expensive matrix multiplication so how and what why does quantization all of a sudden drop matrix multiplication here is where we get into a bit of lingo and you'll notice I kind of skip through parts of this paper right here because I want to get to the meat of things so a matrix

### [5:55](https://www.youtube.com/watch?v=B45FlSQ8ITo&t=355s) Ternary accumulation as a substitute for matrix multiplication

multiplication is WR now that's like this so you have X which is a um a vector and you have W which is a matrix now obviously for each output um output Dimension here across this D dimension for each output Dimension you're going to multiply uh the corresponding input Dimension with all of the entries of w so you get this right uh what whatever you know as matrix multiplication okay now if we use what are called tary weights for the matrix multiplication so standard matrix multiplication right this times a vector in that case probably this is going to be this times this plus this times this okay so it's like a lot of plus and times operation and that will give you one output over here and then same thing for the second column for the third column and so on these are going to give you the outputs right here uh if you have tary weights which means that of this Matrix right here the values cannot be just any number or any floating Point number the values can only be these numbers right here so they can only be uh -1 0 or 1 and that's it so you restrict you quantize or restrict the weight Matrix to only these three values and what that does is it's now no longer you can think of it no longer as a multiplication you have to do so for example between this and this element usually you would have to do a floating Point multiplication however if you know your weights are only going to be one of these three then it's a simple if statement if you will so it's uh it's simply the output of this what would be a scalar multiplication is now just one of three cases so it's a tary if statement if the weight Matrix enter is zero the output is zero okay this is the same as with floating Point operations except that the value is probably never exactly zero second if it's one then you just copy paste the input right because times one you and if it's negative 1 flip the sign so that those are all the cases that exist and therefore you can reformulate this as a just a selection right so you go you take the column here and you just do uh if you know like n numpy do where right if you know this operation you know kind of how this so you just go through and you see where is it one like and then maybe you're like okay here and here it's one where is it negative one oh that's uh here and here okay so all you need to do is you need to select this and this value and you need to subtract that's it no big floating points multiplications and whatnot it's just a selection of two sets cumulatively aggregate them together and then subtract one from the other and that's the whole thing so that's what's written down here the tary mapal can be written as follows we simply take wherever the weight Matrix is one we just add all of these corresponding x value so X is still a floating Point number right not all the values in here are Turner just the weights are turny which means that no matter what x is X can be as high precision as you want you don't need to do an operation with it you can just copy paste it over or in this case sum all of the one entries together and then see where it's NE -1 sum all of those together and then simply subtract one from the other and there you have it okay so this is uh this it makes it a radically simpler operation and it also alleviates the need for floating Point multiplications so as you may know floating Point multiplications at one point in time were actually a very uh very expensive operations for which even CPUs had to build Special you know side accelerators uh and components within the CPU to enable so dropping this could makes an entire componentry of Hardware kind of obsolete if you will all right so they say okay let's do that um and that's you know that's kind of the staple here this Turner multiplication now I have some worries and comments about this in itself you'll notice that I've just kind of highlighted it in the best possible way right we oh we can reformulate this and it's just a selection operation and it drops away uh the need for floating Point multiplications all of this is true um however it's still kind of the same operation right I can also formulate if these were all floating Point numbers I could also formulate um it like this in a similar way where I could say okay let's just take all where the wi J is larger than zero and kind of accumulate the XJ and then minus all where you know it's less than zero here and um I can I still have to obviously multiply by the weight right here so this operation um will fall away like that the effective multiply location uh but other than that it's just kind of a rewriting of stuff if you know if you were to naively implement this in a GPU you would probably still go with the multiplication even though it's like heavily quantized it's only once you actually go to super you know custom kernel efficient implementations of this uh that you actually see any gains and that's kind of the let's say the weakness here our current models are geared towards our hardware and our Hardware in turn is geared more towards the models uh a phenomenon usually called the hardware Lottery U and yeah breaking out of this is difficult even if you have theoretically something that is more efficient so don't yeah don't this is essentially there's two things one okay we can replace scalar multiplication by some sort of binary operations just like do we take the value do we not take the value or do we take the flipped value so I guess Turner operations the rest here that you can accumulate this and then just subtract one from the other and so on that's just shuffling stuff around that that's not really making anything more efficient also this selection right here yes you can imagine it like this but ultimately you will still have to touch every single value right here and in order to go about that the only really thing that falls away is the scalar multiplication and um you replace that by these Turner select operations now they go into a bit of detail here how um how bitnet did that so bitnet showed that stabilizing tary layers requires an additional norm and in bit net this was very inefficiently implemented because of the architectures of GPU memory so bitnet had many IO operations reading something writing it back for quantization reading it again for quantization storing it and so on and they have they essentially say look we have an efficient implementation of that which is the algorithm they present in here so all of this part as far as I understand and I am not a very knowledgeable person when it comes to GPU or generally accelerator architectures and how to program for them uh that's just not something that I have vast expertise in however the way I understand this whole section right here is just they say well bitnet had the right idea um to replace essentially the linear layers with these tary quantized layers however they did it in a very inefficient way and we are doing a better job at implementing this in an efficient way to my understanding that's what they're saying here so not really an a conceptual Advantage but more of an engineering advantage and a bit of a discs towards the bit net people then the second part so the first part is okay we're doing as bit net does right um so that's one part the or maybe not like we're just taking this uh tary idea the tary operation idea from bitnet however they did something they did the linear layer in a very inefficient way we can do it better however bitnet also retained the self attention mechanism from Transformers and this is another graphic right here so you take the current token you produce three matrices in this case three vectors with it the matrices come about by combining it with the key value cache you accumulate keys and query uh which gives you do this attention accumulation right here um which you then multiply together you Aggregate and that goes through a

### [16:35](https://www.youtube.com/watch?v=B45FlSQ8ITo&t=995s) Replacing attention layers with recurrent layers

normalization step again they're saying bit net still retains this right here and what's the reason is and they've tested this themselves the reason is if you just replace the um the attention matrix multiplication like the matrix multiplication in the attention mechanism with these Turner f it doesn't work it crashes so you can see here loss just diverges goes up doesn't converge uh doesn't work so even though uh the quantization is a thing right if you quantize too much that's what this experiment says right here if you quantize too much it doesn't work you cannot train this it doesn't converge and so on so there is something there in the um in the world of floating Point numbers like sometimes you actually do need more than three different numbers to choose from at least with the current attention architectures so the idea of attention that we're going to construct fast weights and we're going to dynamically aggregate uh does apparently rely also a bit on the Nuance right so if I have a bunch of tokens and I compute signals from them the ability to say I want a little bit of this and a lot of this and subtract some of that seems to be important rather than just saying I want this I don't want this and this is very bad so that difference seems to be crucial my guess would be or my hope would be that if you increase the dimensionality somehow uh you could maybe make up for that again so um in the simp case you could just take whatever you have here replicate it let's say you just replicate you just duplicate the values I don't know five times or something like this and then allow the model to choose in a fractional way so you have three times the same value to choose it twice but not choose it three times I'm not sure how you're going to implement that but that essentially would be cheating so you increase the dimensionality and um instead of having a values 2 1 01 -2 you would just simulate the two uh by having twice a copy and allowing the model to choose one or two of them so it would be a bit of a cheat and it would probably be uh kind of Senseless because you're now just trading off dimensions for quantization but I'm just saying if we ever have Hardware that's super efficient for ternary operations this could be a way in which uh in which to go the other direction in which you say ah okay we need to un quantize stuff effectively all right so enough brabble from me back to the paper ultimately they end up replacing the attention mechanism by one of the um like a recurrent neural network that's sort of um in the spirit of modern recurrent neural networks and what do I mean by that so the they take this gated recurrent unit that's a pretty old idea and this is a standard recurrent architecture so in a recurrent neural network you have some sort of hidden State that's carried over from the last operation you usually call that HT minus one you have some sort of current signal which is in this case the current token XT that goes into a box and outcomes first of all an output in the current step o and a new hidden State and that new hidden State then goes on towards the next um towards the next operation so here you'd have XT + 1 and so on you can see that uh this is kind of a recursive or recurrent in this case um operation because all the weights here are going to be shared here and so on this also leads to a situation where if you want to do back propagation you need Through Time meaning that you need to back prop the signal uh through these paths right here so in order to reach um stuff you need to back propop the signal through these paths and this can lead to problems especially because old school recurrent neural network have nonlinearities in these paths in the hidden to Hidden paths so that means um your gradient flow passes a couple of nonlinearities in between these stages and that means you cannot effectively parallelize training because with a nonlinearity you always you have to compute this sorry about that then push it through the nonlinearity either forwards or like backwards in this case then compute something here then go back through the nonlinearity and so on um new age rnns avoid this by simply making it linear and that's approximately what we do right here so this is the old school thing it's best to go from the back so in a GRU the output and the hidden state are equal so the output is the hidden State how does the hidden State for the next step come about well we're simply going to take the last hidden State as you can see right here this is the last hidden State and we're going to multiply it by what's called a forget gate so we're going to decide how much of the last hidden state do we want to forget and then this right here you can see it's one minus the forget gate how much of the new this is called the candidate hidden state do we want to add so you have the old hidden State the candidate you decide how much of this do I want to forget and in the places where you forget you're going to exactly in those places you're going to add from the new candidate hidden state right that's what this forget gate does this candidate hidden State how does it come about well you can see right here um uh it comes about by taking into account the current signal that's the X right we transform it with a matrix multiplication it current signal this here is just a bias Vector plus the old hidden State and um the old hidden state is first transformed by R and R is dependent also on the old hidden State and the current input so this is kind of an intricate connection essentially what you're saying is that I'm going to update the hidden State how am I in a way that depends on both the current signal and the last hidden State and how does it depend on those well by well okay this this as well so it's a data and state dependent update of the Hidden state right so every step you look at what's your current state what's your current input and those two things determine your candidate hidden State for the next step now this is very powerful it allows you to make all sorts of dynamic decisions about uh the about the update however it's also highly intricate and highly nonlinear and therefore it leads to these problems where you can't parallelize training you have to do one step after another and in the backward pass you have to back propagate one step after another uh you can't effectively over the time Dimension parallelized training like we can do in a Transformer there therefore they're just going to linearize this and what's especially important to linearize is the connection between hidden and hidden so what you're going to see in their updated version of this is first they're going to introduce an output gate so no longer is the output just equal to the hidden State the output is now going to be some transformation of the Hidden State as you can see right here and that transformation is dependent on the current signal this they say okay we have we're inspired by other ones of these kind of new age recurrent networks um I'm not sure with these types of things I'm never really sure whether this was here from the beginning because technically is not needed to implement their idea or whether they have implemented this after stuff didn't work out just as much as they want it so this is a bit the trouble I have with kind of a lot of papers in general there's always the new idea and the new idea here is can we replace Matrix multiplications right from a scientific perspective um you can argue that you shouldn't do anything else you should just replace you know these things and then we can see what the effect of that is on the other hand a lot of these papers introduce many of these kind of tricks that essentially have nothing to do with their idea it's just that oh we also do this right and presumably they do it because it makes performance better however you then don't know whether their increased performance or how much of it is just actually due to these tricks and how much is due to their actual idea on the other hand you could also argue well look current architectures and everything we current currently do is so optimized for the self attention mechanism that it's only fair that you should employ all the tricks you can to make your new idea as performant as possible because only then can we kind of compare apples to apples which is two like two stacks based on different ideas that each individually have been optimized you know to Perfection if you will uh so it's like what's the maximum we can get out of this idea or that idea if we Implement all the tricks I don't know what the correct answer is usually ablations are or good answers like this usually if I see something like this I think okay this is at least part of the performance is due to these kinds of things right in any case they do that they do this uh inspired by other New Age rnns we'll call them New Age rnns um and also crucially so you can see this operation here this is again a tary operation so while when they replace cell attention mechanism matrix multiplication with tary operations it doesn't work here it's when they change the recurrent neural network to this more linearized architecture it does work when they change the uh Matrix multiplications to Turner operations so you can see right here in many places that there are these turnar and uh pointwise operations now in here and that means uh you don't have any Matrix multiplications left also notably you can see the new hidden state is again a function of the old hidden State and the candidate state so nothing of this has changed much but the candidate State now is no longer dependent on the old hidden State Okay so how are you going to update the hidden state is no longer dependent on the last hidden state so you no longer make State dependent decision on the update you only make data dependent decisions on the update um we have seen a number of architectures recently kind of dropped the more like tradeoff data dependence uh for you know linearization or scalability and so on and I think that's kind of the Crux here uh the question is what are the kind of things where you would need a data dependent update of the Hidden state or a state dependent update of the Hidden State and I would argue in a lot of places you probably do right how you're going to update the current state should be quite dependent on the current state if you think of you know a model that needs to perform in inference across a sequence of tokens right it cannot be that the update of the Hidden state is only dependent on the current token and not dependent at all on the current hidden state yet that's what this paper suggests that's what essentially like States based models they don't even do anything data dependent they're just like oh well here is how we accumulate the past no matter what the past is right so you can probably get quite far but I would argue that the beauty of Transformers is that all these operations are kind of fully data dependent and that data dependency enables especially once you go big um large you are able to integrate so much of that information into your operations that you can start doing very complicated stuff so your statistical modeling capability is just larger that's just a personal opinion of mine I have no data to back that up I just feel dropping the state dependence of the um of producing the candidate hidden State here is going to be a major drawback even though in the experiments in this particular paper it doesn't really make that much of a difference in fact they even are hopeful about the future and so on I'm a bit more skeptical just on the basis of frankly intuition uh so I might be totally wrong okay so this um gr this modified Gru um with the linear um linear relationship between hidden States right no State dependent hidden State update which would make it nonlinear so the linear dependence of hidden state which makes it parallelizable so if every update is only dependent on the current data it means I can calculate all of these updates in parallel right if I have a training sequence I can calculate all of them in parallel and then train back propagate to all of them in parallel which gives me this training scalability that Transformers um that uh Transformers with a what do I say diagonal this diagonal attention mask and sort of this layer by layer computation have so this is one part the other part is uh the channel mixer

### [32:40](https://www.youtube.com/watch?v=B45FlSQ8ITo&t=1960s) Replacing dense layers with ternary channel mixing

um yeah to be maybe to be a bit clearer they say well the self attention mechanism is essentially two things it's a token mixer which means we integrate or the Transformer essentially to two parts we integrate information across different tokens right um you know with attention we can all some of this token some of that token we pull that together that's attention mechanism the other part is a channel mixer so the channel mixer is where we take signal and say well this the signal uh is a vector let's say x is a vector here we need to transform we need to uh Channel mix meaning we need some sort of operation that uh makes a new Vector where each Dimension here could incorporate information from multiple dimensions in the previous signal so not across tokens but with within tokens across Dimensions usually that operation you know as a matrix multiplication like a matrix multiplication does exactly that it mixes channels however no matrix multiplication here um what they do and here is much more straightforward they simply replace the Matrix multiplications uh with these turnar operations um and um so this is in a way that's called a Glu General linear unit I guess so it goes like this uh you take signal X you produce two different values from it uh let's call they call them G and U the G you in you additionally put through a nonlinearity that's in their case a c this one right here you go through a nonlinearity you get something like G Prime uh you multiply them together dot wise and then you transform this again into I guess y so of the arrows or the arrows that are not labeled here like this one and this one those are technically Matrix multiplications because you kind of up project and then down project again whereas in this case they just replace them with these ternary operations so in principle like even though it's more complicated but in principle just view this as the um the feed forward layer in the Transformer except where we have replaced the Matrix multipli ations by these tary operations yeah not super duper accurate however uh as a principle um it's yeah it's viable I think so the channel mixer here only consists of dense layers which are replaced by with Turner accumulation operators and that's it so the channel mixer is saying instead of attention we are taking the um we're taking this recurrent architecture we make this linearizable parallelizable at which point we can replace the Matrix multiplications in there with Turner operations because if we just do it to the attention then everything crashes we can't drain it and then secondly um we replace the kind of feed forward part of usual Transformers um with simple dense layers where also we have replaced the Matrix multiplications with Turner oper operations uh bit more involved yeah but so the particular gler losing they're using right here is also used in modern architectures like llama mistal and RW KV and that's it that's their model um so whole architecture no more Matrix multiplications most of them have been replaced by tary operations they there's a few training details right here uh stuff I find interesting obviously if you are super duper quantizing uh not very good gradient flow through that so they're using this thing called the straight through estimator where you essentially have uh two streams so there's one forward signal that goes through the quantization let's call that q and you sort of keep the non-quantized forward signal there and when you back propop you kind of pretend and you didn't do it um it's not also not a very accurate description but that's what we usually call the straight through estimator um so you essentially estimate the gradient um it allows you to estimate the gradient through these through some of these nonlinear operations the second thing is that the learning rate needs to be higher so they say it is common practice to employ a larger learning rate when training inary or tary weight language models so also interesting lastly they say that the training Dynamics differ from those of conventional Transformers necessitating a different learning strategy uh so they do a cosign learning rate schedule but reduce the learning rate by half Midway through the training process so they need to start out with a relatively large learning rate go with the cosign Decay that's common in current llm training but then reduce it more drastically um halfway

### [38:30](https://www.youtube.com/watch?v=B45FlSQ8ITo&t=2310s) Language modelling results & scaling laws

through so these here are some of the results they achieve you can always see the comparison between their m free LM and Transformer Plus+ which is kind of the reference uh implementation they compare against and I do first of all there are couple of interesting things right here we're going to draw your attention to the scaling law curves right here you can see those at the bottom so this is the scaling law curve for molf free and this here classic Transformers and you can see that all of these curves here are the actual experimental curves where they end up like the last points here is where they um where they draw the lines through right and you can see that the way they draw the lines the um matol fre ones are above the classic Transformers meaning they are worse however if you connect the dots you can see that the slope for the mol free ones is steeper than the slope for the Transformer ones and therefore there's going to be a crossover point right here at the out of the way 10 to the 23 flops and that's about the current compute that uh llama models use for training so if you believe this um if this trend actually goes on that means once you get to these points over here it's actually going to be more efficient to put your compute into mapal free Transformers than classic Transformers you get more bang for your buck or for your flop or for your whatever it is at that point um so those are projections however those projections are based on three experiments so there's the 370 million parameter the 1. 3 billion parameter and the 2. 7 billion parameter models so there's a couple of issues right here issue one you can kind of see and that's probably a property of the learning schedules but you can kind of see that the matal fre ones always um always kind of drops faster in the loss but then sort of get overtaken at some point like this is consistently the case right here uh they get overtaken by the Transformers at some point um you can see that the gap between the classic and the mapal free Transformers get smaller and smaller as we go up in scale so that's what makes the slopes different um however it's three dots and extrapolating in like the log scale on three dots to me seems a bit wonky especially in terms of the difference we're talking about right here plus if we go with the hypothesis from before that actually this dropping of the like of some of the stuff like the simplifications they have to do to get this architecture to work um will mainly hurt in more complex tasks right like the removing of the state dependent update to the hidden state will mainly hurt you when you have to do more complex inferences across sequences which are so far mostly learned once you go up to larger scales like the larger the scale the more complex these interactions that can be learned meaning that it's not at all clear that this this is a these are super straight lines and that these super straight lines will also be super straight lines for molf free Transformers it in fact we don't know that they're super straight lines we only we when we draw the plots we think they're super straight lines maybe they're actually slightly curved and maybe that slight curve in this case is more drastically slightly curved for mapal free ones not because they're mapal free but just because they're kind of simpler architecture whose ceilings is um different like if you think about it loss cannot go down infinitely right like we agree that okay maybe it's not feasible to go over here you know feasible in terms of humanity producing these resources to go over here but we all agree that the line cannot go down forever okay so the line cannot be a line the line must be flattening out at some point all right so these aren't actually lines they're actually in my opinion ever so slightly curved lines because there must be a ceiling we cannot loss zero we cannot go be below zero loss okay so there must be like some sort of ceiling and now the question is that ceiling at exactly zero which I also don't believe I believe it's not exactly zero the second question is it the same for every single model architecture I also believe no that's not the case they're clearly model architectures that no matter how large you go they will never reach that because they don't have the inherent capability so therefore couldn't it be that instead of these being lines that have a crossover Point here they're slightly curved and these will always be slightly worse than the classic Transformers I don't know what that still doesn't exclude is that they come close enough to be useful right like even if they never cross they could come close enough to be useful so that's my rant about um about scaling laws in this case like I am I can see what they're saying right here and it is very exciting especially very exciting that the larger they go the smaller these gaps become however I'm kind of skeptical of that inference or like oh that there's a crossing point and you know they're actually more efficient than something like this I don't know just my

### [45:00](https://www.youtube.com/watch?v=B45FlSQ8ITo&t=2700s) Other experimental results

opinion all right if you look at the results they're actually pretty okay so I mean with a lot of these um newer papers the uh the gist is essentially hey look we are better in some sort of uh auxiliary way like we're better in terms of scalability Energy Efficiency in this case we're better in terms of um having to do a lot of operations right because we Dro these Matrix multiplications and we are sort of comparable to the performance that you get with matm containing Transformers uh and yeah the larger you go the uh less that difference becomes also something to say right here that this difference right here I'm not really sure if you calculate it right if you like also the the more performance you have the harder it is to do one additional performance so it's a lot easier to go from 42. 6 to 45. 0 than it is from 58. 5 to 59. 7 right no the other way around it's yeah it's a lot easier to do this than to do this so just saying like the performance Gap gets smaller also isn't entirely it's obviously accurate but also isn't entirely what you imagine it to be right um it could very well be that effectively this here is actually closer together than this here if you look at it from the perspective of how much would you need to invest to bridge the gap that's just another thought on interpre some of these numbers right here in any case we do recognize that these things do perform quite well uh and for the fact that they don't have a single matrix multiplication in there it is quite remarkable what they do and at the you know just the hardware optimizations they can bring like a lot less RAM usage for example uh a lot shorter latency and so on so again even if they never um will achieve full Transformer equivalent performance it could still be that they're super duper useful because we can run them on edge devices we can you know use less RAM for them they're faster and so on and to get the last bit out of them yeah they do implement this in on custom Hardware that can better exploit Turner operations so they create an fpga accelerator and this is something that I have no clue of just to be said that obviously current Hardware is optimized for the things that we want to do with them and um if you really want to get the most out of new ideas like this like turnar operations you do have to have custom Hardware built for that but if you have that you can make

### [48:20](https://www.youtube.com/watch?v=B45FlSQ8ITo&t=2900s) Conclusion

really really big advances um that's yeah please look at the paper is quite well written I have to say um and I do believe these sections unlike their fpga implementations and experiments with them are super interesting I'm just not a I'm just not uh an author a good Authority or even a good source of information to make sense out of them all right so the conclusion is we achieve performance on par with state-of-the-art Transformers while eliminating the need for mapal operations and I my opinion about this is I agree this is very cool uh it promises a future of at least having options to do Edge inference to do a lot more Hardware efficient inference and so on while I do remain a bit skeptical on the tradeoffs that they have to do in order to achieve that um I would personally think that going forward CL classic Transformers will still have quite an edge in a lot of areas um and I'm also a bit skeptical about that crossover Point prediction because of these reasons but all in all very cool paper and uh thanks for listening that's it stay hydrated and bye-bye

---
*Источник: https://ekstraktznaniy.ru/video/11930*