# TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters (Paper Explained)

## Метаданные

- **Канал:** Yannic Kilcher
- **YouTube:** https://www.youtube.com/watch?v=gfU5y7qCxF0
- **Дата:** 23.11.2024
- **Длительность:** 28:22
- **Просмотры:** 19,618

## Описание

A deep dive into the TokenFormer and an opinion about its impact, novelty, and relation to prior work.

Paper: https://arxiv.org/abs/2410.23168

Abstract:
Transformers have become the predominant architecture in foundation models due to their excellent performance across various domains. However, the substantial cost of scaling these models remains a significant concern. This problem arises primarily from their dependence on a fixed number of parameters within linear projections. When architectural modifications (e.g., channel dimensions) are introduced, the entire model typically requires retraining from scratch. As model sizes continue growing, this strategy results in increasingly high computational costs and becomes unsustainable. To overcome this problem, we introduce TokenFormer, a natively scalable architecture that leverages the attention mechanism not only for computations among input tokens but also for interactions between tokens and model parameters, thereby enhancing architectural flexibility. By treating model parameters as tokens, we replace all the linear projections in Transformers with our token-parameter attention layer, where input tokens act as queries and model parameters as keys and values. This reformulation allows for progressive and efficient scaling without necessitating retraining from scratch. Our model scales from 124M to 1.4B parameters by incrementally adding new key-value parameter pairs, achieving performance comparable to Transformers trained from scratch while greatly reducing training costs. Code and models are available at \url{this https URL}.

Authors: Haiyang Wang, Yue Fan, Muhammad Ferjad Naeem, Yongqin Xian, Jan Eric Lenssen, Liwei Wang, Federico Tombari, Bernt Schiele

Links:
Homepage: https://ykilcher.com
Merch: https://ykilcher.com/merch
YouTube: https://www.youtube.com/c/yannickilcher
Twitter: https://twitter.com/ykilcher
Discord: https://ykilcher.com/discord
LinkedIn: https://www.linkedin.com/in/ykilcher

If you want to support me, the best thing to do is to share out the content :)

If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: https://www.subscribestar.com/yannickilcher
Patreon: https://www.patreon.com/yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

## Содержание

### [0:00](https://www.youtube.com/watch?v=gfU5y7qCxF0) Intro

hello there yeah it's cold here um we cannot be dissuaded from reviewing papers today we're going to look at token forer rethinking Transformer scaling with tokenized model parameters this seems to be a collaboration of MOX plunk Institute for informatics Google and picking University this paper proposes as it says the token forer which is a modification of the Transformer architecture that as they say treats model parameters as tokens and therefore introduces a new kind of axis of flexibility in Transformer scaling and ultimately that's going to result in an architecture where you can add parameters to a already trained Transformer or to an already trained model and then kind of just continue training it at that bigger parameter count um in my opinion it there there's like 5% of an idea and then 95% is like smok and mirrors trying to ouch things in modern words that have already existed for a long time and there're there are fundamentally nothing new I don't want to be too harsh from the outset or though I probably just was but we'll dive into the paper so first what are they attempting to do they say look Transformers uh are require the substantial cost of scaling these models remains a significant concern and they depend on a fixed number of parameters specifically within their linear projections and when you try to do any modifications to Transformers then that typically requires retraining the entire model from scratch so they introduce this token former which leverages attention um not only for computations among input tokens like a classic Transformer but also for interactions between tokens and model parameters so previously you had two types of stuff in Transformers you had um you would you would chunk up your text into different tokens and then you had the attention mechanism where essentially you do token interaction so tokens paying attention to other tokens and so on and that giving rise to the next layer or the next representation of tokens and then you had these kind of feed forward networks um that would take each token and push it through and make it into the next representation you that separately so actually both of these contain model parameters and both of these interactions are uh done by linear projections so in the feed forward layer what happens if so token X goes in here the feed forward Network would have parameters W you and if you have y here as the output well that's not the correct one let's say um X and X Prime up here so you would have X Prime equal to WX or something like this like a linear projection here uh in the attention mechanism you do have token interactions but before that you actually so a mention typically consists of some kind of outter product of uh queries and keys and that goes through some sort of softmax operation and is multiplied by the values and that gives you so this here would be your attention Matrix and then that whole thing would give you the output of the next layer um all the tokens together so to come up with the q's and the K's and the v's you would actually also do x * like WQ x * w k x * WV so also here you have interactions between inputs and model parameters so even to facilitate the attention you still have interactions with model parameters and their whole point is well what if I want to modify add to these parameters well then I essentially you know my whole thing here changes my whole dimensionality changes and so on um and I cannot just use the same model again I have now a bigger model my whole internals are bigger and therefore I need to retrain everything from scratch and that is maybe not so good maybe we want to not retrain things from scratch so there are goal is to replace these interactions here and these interactions here with another tension mechanism um if you know anything about attension that you know that in principle we could just add tokens here right if it weren't for the position embeddings so let's assume we have no position embeddings we could just add tokens and the exact same mechanism the exact same Transformer could process them no problem so the Transformer itself isn't dependent on the length of the input sequence and they extend that to the parameter space and say well what if if we use a tension right here we wouldn't necessarily be dependent on the size of this W Matrix uh and therefore we could just increase it and that allows us to add more parameters to the model okay so that's

### [5:45](https://www.youtube.com/watch?v=gfU5y7qCxF0&t=345s) Trainable Parameters

essentially that um they're going to end up with experiments like this one saying look if I have to train a Transformer from scratch it's going to you know if I want to train a 124 million um parameter Transformer is going to require me some sort of cost of training and I'm going to reach a certain perplexity if I then have to train a bigger one and I have to start from scratch it is going to require me quite substantially more training than if I can just scale up from my previous size to this one using our technique I waste almost no um no computation here in order to get to that next size and so on so as I already said they're going to replace in two particular places in the Transformer uh the linear interactions so first of all you can see here is a classic Transformer and you have this uh qkv projection that's how you obtain the queries keys and values for the attention that is done via linear projection so what they're going to do is they're going to place this by an attention mechanism um itself so an attention mechanism is going to give rise to these things right here not a linear projection and then the feed forward Network also is going to be an attention mechanism so the trainable parameters are going to be these what they call Key param and value param in both cases these are trainable parameters so they're not directly multiplied with the signal but they're supplied to the attention mechanism um and they are tokens so they the trainable parameters you can like the key paramet consists internally of a set of um of tokens right so there are n tokens right here and the value params obviously as well and then the way this works is the input here is used as queries into these um into these keys right here and that defines an attention Matrix that Aggregates the values okay so it essentially means that for each um query here you're going to get an output that is a weighted sum of whatever the values are weighted by how well the query matches the key that's a standard definition of an attention mechanism M right uh actually in no point of the attention mechanism does it require that you have as many queries as you have keys and values only the keys and the value numbers need to match um and then yeah so you essentially have the freedom of having as many key and value tokens as you want as long as you have the same amount of each you can see that this is just a function it takes the input signal as an input and it takes the trainable parameters gives you a output of in the order in the size of the input signal so you're just going to do uh the queries which is the input signals times the keys which is the learnable parameters softmax multiply by the values which is also learnable parameters and that's going to be your next layer representation uh so this you're doing this instead of a feed forward network uh with parameters W of X okay um and the same goes to obtain like the Q of the actual attention layer to obtain the K and to obtain the V of the actual attention layer you're going to do one of these operations each which gets a bit meta and a bit confusing but in order to obtain the k of the attention layer um you're going to do an attention mechanism of the input signal with the key and value parameters so you're going to do the input signal X um times the key parameters for coming up with attentions K softmax aggregate the value parameters of attention k right ah now I got even confused so you have separate parameters for coming up with the K V separate parameters for coming up with the Q just as if you were to do of k equal w k x right so just as if you had a linear projection um where you have separate parameters of coming up with the keys the queries and the values so I hope you can imagine something among that and you might think oh wow that's a neat idea all right because I can now just add parameters here and especially if I zero initialize the one of the two only has to be zero initialized um for example the keys I guess uh no let's say the values are zero initialized right uh you can aggregate as much as you want the values will the new values will never actually um do anything uh except you know once they're actually trained start TR training start changing their value so you can just on the fly at parameters right here and you will not change anything about the model so that's pretty good now why do I have a bit of my

### [11:40](https://www.youtube.com/watch?v=gfU5y7qCxF0&t=700s) Diagram

gripes with this work so yeah they go through everything here you can see that this is a diagram of essentially what I've just shown now contrary to before the diagram goes uh top down right here so you can see this here replaces the traditional attention mechanism and then uh feed forward uh mechanism and yeah also here you can see that this essentially acts as queries keys this as values so you have a matching attention scores uh using soft Max and then a weighted sum of the values which gives you the output and if you have old ones you can add new ones and they say here somewhere how you initialize them with zeros here so we augment this set by appending new key value parameter tokens as this um so you concatenate old and new key tokens all and new value tokens and this scaling scheme permits the integration of an arbitary number of parameters without altering the input or output Dimensions by initializing the Keys okay the keys um with zero similar to Laura our model can perfectly resume the model state from pre-training phase without losing the well- Learned knowledge facilitating easier convergence and accelerating the overall scaling process all right so I have essentially two problems with this paper one is even if you take it at kind of face value that this is a new thing this is novel this is different and so on if you actually closely look at their curves that I've shown you then it's kind of odd so first of all uh here they say what we did is we trained um one from scratch right and then one incrementally now the one from scratch if I recall correctly or this could be the one further down I think is the more elaborate curve right

### [14:00](https://www.youtube.com/watch?v=gfU5y7qCxF0&t=840s) Comparison

they the one from scratch here is trained with 300 billion tokens right um and the other ones are trained and here it says 15b 30b and 60b what's actually happening is that the first one here is also trained with 300 billion tokens and then you add an additional 15 billion tokens or 30 or 60 billion tokens to get the respective curves right here so in order for this actually to make sense and to give you a benefit you assume that your comparison is someone training a classic transformer for the full duration for all the sizes right so you kind of consider someone saying like Okay I'm going to train this one and this one and compared to someone like this if you actually want to go stepwise through all the sizes this is a benefit because it starts from uh a essentially it starts from an already trained smaller version now what I find weird first of all is that even at the smallest level like the Baseline they're different like this method here already outperforms for some reason the classically even though they're both just initially trained with 300 billion tokens which I don't know already feels like a lot of what a lot of papers do is they will find good hyper parameters for their model their whatever they're doing and then they'll just say like we just use those hyperparameters everywhere which I think I've read in this paper as well which obviously is then you know it's good hyper parameters for you it's good settings for you and what so okay so they already start um with you know further down but then what's interesting is that for the same size for example these sizes right here uh this one's better like yes it costs more to get there but this one's better and if you actually consider that someone has actually skipped the lowest one and just says well I want to train a Transformer with 354 million parameters they're going to train it for this you know for this much um they're going to go through 300 bill billion tokens right now these models here have gone presumably through 300 billion plus in the yellow case 60 billion tokens so through more tokens and they're worse that that's a bit what I find suspect here is that also here you can see this is this went through 300 billion tokens whereas this here presumably at this size plus 60 at least right in order to get here and then it's still worse like goes through more tokens and it's still worse um that's a bit to me yeah suspect also then I don't know what this um training cost here actually represents I'm just going to assume it represents training this particular model from scratch and not training the sequence of these models from scratch uh but I can see that you know once you're bigger your cost obviously goes up um and that's what makes the cost lower so the only thing here is right they can say look we have a lower kind of total cost of getting there but then they kind of end up at a worst place which I mean it's fine but I just wanted to point out that even these models they start at the number of tokens that the from scratch models have as their final State okay the second thing is um and that's a bit more crucial in my opinion their framing of this entire thing so their framing of oh we're going to introduce new parameters and the

### [18:25](https://www.youtube.com/watch?v=gfU5y7qCxF0&t=1105s) Traditional Transformer

parameters are tokens and whatnot if you look at the traditional trans former it's let's just look at the feed forward part of a traditional Transformer I previously said you have a token you push it through and it gets you a new one right so let's call that X Prime and here we have a set W that's actually not the whole story even the very first attention is all you need Transformer actually considered a token going through an up projection right which we call let's call that W1 and then a down projection again let's call that W2 and that will give you um X Prime and in the middle there's some sort of a nonlinearity okay so the actual thing was we have x * W1 nonlinearity reu or something W2 and that gives you X Prime now you can see that also here we have a free parameter let's call that M because they're calling it n we have a free parameter we can make this inner Dimension as large or as little as we want and the rest of the architecture isn't affected at all which is one of the things they say is suddenly possible with their architecture because they can add tokens right the second like the second thing is I can also here start from scratch like if I just call this W1 K Tilda and I call W2 V Tilda you'll be able to see that oh what I'm doing is I'm multiplying X by K Tilda right whether this is transposed or not right who cares it's a linear operation I have some nonlinearity and then I have V Tilda and then it looks all of a sudden a whole lot like what we had there and the same thing applies if I want to add parameters I can just take my K Tilda and I can add depending on whether you consider row or column multiplication and I can just add a bunch of zero columns right like so and as long as I as long as to V I add the corresponding zero rows um or not even I just have to add rows then it will be exactly the same so I can just fill in my K lower sized K and then add zeros and it will give me the exact same result while increasing my parameter so also this has long been previously possible um like so and I would argue It's actually an inferior thing because uh people who have done this in the past and uh myself included in that uh it's probably better to not do this but to use some sort of actual up projection if you want to use some kind of lower trained model and transfer it to a bigger model uh some sort of like um orthogonal projection and so on they have really nice mathematical properties whereas filling this up with zeros will kind of give you a scaling issue in some sense um so you might want to actually try that but nevertheless the zero adding has been done the even experiments in changing this Dimension here has been done if you look at the original Transformer paper they actually have ablations varying this inner Dimension while keeping the rest of the architecture completely the same neither the original Transformer nor this paper manages to actually change the dimensionality of the actual tokens because in order to do that then you would actually have to like there's no way to just add zeros to something uh you're actually going to change the forward propagated signal there's actually a way to do it um if you look at what they do in the attention in the uh kind of attention part here you could just add uh appropriate zeros and then kind of your if you go from an X to a to Q right uh you could if this is your original Vector you could just add a bunch of zeros here and if you do that consistently you essentially your inner products with the K's would kind of null out here and you would get the same inner products down here modulos some scaling Factor so even there you can add zeros so the only thing if you actually compare it that this paper does new if we write this side by side is it will multiply X by their parameter key parameter it will then do aggregate the value parameters and it does a soft Max here a scaled soft Max instead of a relu and it like it uses the same mechanism inside of ATT tension as well instead of just but also that like you could just say well instead of obtaining our Q values by doing WX we could just obtain our Q values instead by doing W2 uh nonlinearity W1 of X right have a bit of a more powerful representation in order to come up with these intermediate values and that's essentially it um the rest that this paper is just couching this change in nonlinearity into the language of tokens token parameter interactions token interactions scaling flexibility and whatnot um so yeah that is a bit I don't

### [24:40](https://www.youtube.com/watch?v=gfU5y7qCxF0&t=1480s) Conclusion

know like I feel like even though it's like even that would be possible but then they don't mention that they never and if yeah if you look at the formula yeah it's they never mention that they never say hey look even the original Transformer paper essentially did what we're doing right here but we kind of have a new way of looking at it that would also be fine but the only place where they compare to like anything classic is this um oh we have this net to net way of adding parameters to a neural network and we're going to compare to that be somewhat better here and that's it now there is a slight chance there's a slight chance that even the authors thought that they have something super new because you know once you start thinking in this oh ah we're going to replace it by a token attention mechanism and whatnot yeah you tend to be a bit um a bit driven into this world by yourself but who knows um yeah so all in all like looking at the technique by itself I actually think it's is quite okay it's probably a good way of messing with one part of the model size like with there is a certain flexibility here where you can add parameters not like it's adding parameters in a distinct way that influences one part of the model it's not like you cannot change all of the architecture you cannot change representational you cannot add representational capacity to the forward signal itself what you can essentially do is you can add computational capacity like the complexity of transforming um a signal from one layer into the next layer here that you can change with this method you cannot change the fundamental carrying capacity of the forward signal with this method right here so there it's a way to modify one aspect of the Transformer model that has already been present at least in the feat forward layer in the very first Transformer paper uh if you want to argue you can say this paper extends that also to the computation of the attention inputs um and that's a cool thing and you might want to look at stuff in this way in order to understand them better in order to extend your freedom of experimentation on the other hand yeah as I said a lot of this is just word couching and if you actually write down what it does you will find that it's essentially a different nonlinearity and that's it all right this was my uh look through this paper and I hope that wasn't too harsh um I yeah again I do think like the research in itself is pretty good and is a good way of thinking about stuff all right if you have different opinions please let me know and I'll see you next time bye-bye

---
*Источник: https://ekstraktznaniy.ru/video/11888*