# Mixtral of Experts (Paper Explained)

## Метаданные

- **Канал:** Yannic Kilcher
- **YouTube:** https://www.youtube.com/watch?v=mwO6v4BlgZQ
- **Дата:** 13.01.2024
- **Длительность:** 34:31
- **Просмотры:** 65,527

## Описание

#mixtral #mistral #chatgpt 

OUTLINE:
0:00 - Introduction
3:00 - Mixture of Experts
6:00 - Classic Transformer Blocks
11:15 - Expert Routing
17:00 - Sparse Expert Routing
22:00 - Expert Parallelism
25:00 - Experimental Results
31:30 - Routing Analysis
33:20 - Conclusion

Paper: https://arxiv.org/abs/2401.04088

Abstract:
We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. Even though each token only sees two experts, the selected experts can be different at each timestep. As a result, each token has access to 47B parameters, but only uses 13B active parameters during inference. Mixtral was trained with a context size of 32k tokens and it outperforms or matches Llama 2 70B and GPT-3.5 across all evaluated benchmarks. In particular, Mixtral vastly outperforms Llama 2 70B on mathematics, code generation, and multilingual benchmarks. We also provide a model fine-tuned to follow instructions, Mixtral 8x7B - Instruct, that surpasses GPT-3.5 Turbo, Claude-2.1, Gemini Pro, and Llama 2 70B - chat model on human benchmarks. Both the base and instruct models are released under the Apache 2.0 license.

Authors: Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven Le Scao, Théophile Gervet, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed

Links:
Homepage: https://ykilcher.com
Merch: https://ykilcher.com/merch
YouTube: https://www.youtube.com/c/yannickilcher
Twitter: https://twitter.com/ykilcher
Discord: https://ykilcher.com/discord
LinkedIn: https://www.linkedin.com/in/ykilcher

If you want to support me, the best thing to do is to share out the content :)

If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: https://www.subscribestar.com/yannickilcher
Patreon: https://www.patreon.com/yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

## Содержание

### [0:00](https://www.youtube.com/watch?v=mwO6v4BlgZQ) Introduction

hello today we're going to look at Mixr of experts this model has been out for a while it has there's been blog posts and so on but this is the paper about the mixol 8x7 B mixt of expert model uh that is built on the mistl 7B architecture and obviously released by mistl Ai and you can see we are going back to the good old days of word art uh I like it but it does take some getting used to the paper goes by a different name that's the nickname I gave it which is don't say data and you'll see that the entire paper goes without ever telling you even the slightest hint about where the training data came from now this I believe is a smart choice because uh the let's say fashion dour of the professional complainer group that's around ISS to complain about where the training data comes from and lawsuits are being filed about copyright on training data and blah blah it used to be different things all the biases of the models and then it used to be all the crowd workers or something the trend let's say on in Vogue uh with the professional complainers is to complain about where the data comes from so Mist doesn't say anything which I guess is smart but it's also a little bit weird in a research paper that is kind of supposed to tell the research Community how you did stuff and how they can reproduce it so uh yeah I don't know if you don't know Mr AI is a new startup uh I believe out of France and their approach so far is the most open source approach of all the AI startups out there including uh stability AI to some degree which Prides itself on being very open source the mistol models are released under the Apache License which means is do whatever you want uh although the flip side to this is as I already said they don't tell you where the data comes from whereas I believe the um stability AI does tell you at least to some degree where the data comes from uh and how they obtained it yet their models are released under sort of these uh Stupid licenses that have usage restrictions in any case this paper there's not too much of surprises in here uh you'll see they say well this is a Transformer with a mixture of experts architecture so I think it's an opportunity to look at what that actually means I've made videos in the past on mixture of experts expert routing and so on but we can dive into that here again and just see

### [3:00](https://www.youtube.com/watch?v=mwO6v4BlgZQ&t=180s) Mixture of Experts

how it's done so mixol 8 x7b is a sparse mixture of experts model with open weights under Apachi 2 it outperforms llama 270b and gbt 3. 5 on most benchmarks now what is special about this is that if you can this right here is the total parameter count of the model so and there is a reason they don't just say well it's uh 56 billion parameters I don't actually think it's 56 it's probably less no um we'll get to that so it's total parameter count is less than for example llama 2 now GPT 3. 5 we don't know GPT 3 was 175 it's quite conceivable 3. 5 was a distilled version of that but it is safe to assume that this Mixr model has less total parameters than these models and therefore it's quite cool to see that it does outperform these models on various benchmarks the other thing here and why they write it like this is that they do have this mixture of experts this expert routing in there which essentially means that for every token it only uses a subset of its parameters so as a forward signal goes through the network not every single part of the network is even used is even activated for that forward signal and therefore the actual parameter count that's used per token is even less which allows for some optimizations you can do either by being faster or by achieving higher throughput so they say here faster inference speed at low batch sizes and higher throughput at large batch sizes it's a decoder only model and the feat forward blocks picks from a set of eight distinct groups of parameters as I said we'll look into that briefly what that means there's a 32,000 token window context size which is also on par with other current Transformer based large language models it is huge like it's a big is a big context window um and here is our first hint according to the where the training data comes from it is Prett pre-trained with multilingual data that's it it's so they could literally take I don't know the pile that's multilingual data they could take Shakespeare and just add one German phrase to it we don't know and uh they don't say so what is a mixture of experts model usually in Transformers we talk a lot about tension the attention

### [6:00](https://www.youtube.com/watch?v=mwO6v4BlgZQ&t=360s) Classic Transformer Blocks

mechanism and so on as being kind of the core ingredient of these Transformer models but in fact Transformer models and Transformer models are essentially you have sort of your input tokens here every token gets turned into a vector by an embedding layer then at the very top you have sort of an out output which is just kind of an inverted embedding layer so this let's call it embedding 1 that so you have output vectors and and they'll be turned into uh into tokens again or if you do next token prediction you just have one at the very end right here that's then predicted into one of 32,000 tokens or something like this in the middle here you have these Transformer blocks and they'll be repeated I don't know nend times every Transformer block this has evolved a bit through the years let's say but Transformer blocks are usually stacking on top two different core layers you know you which are the attention layer and the feet forward layer so this will literally be some sort of attention layer and then some sort of feet forward uh layer or feet forward network uh what is also called as I said usually when we talk about Transformers we talk a lot about the attention layer what happens in the attention layer is that if you have a sequence of tokens that's being input the what the attention layer can do is it can share pass information around so you want to transform this signal into the next signal so you get the same amount of vectors again and the attention layer allows you to sort of pass information between any of these via the attention mechanism right the next layer that comes so this is a tension the next layer is the feet forward Network and is different the feed forward Network applies to every single token or every single output Vector whatever you want to call it here so if you have a sentence with different tokens and these would be the intermediate representations of them every single one goes through this network by itself um so this is a function that applies individually to each token so this here will be some sort of a vector right and it will go through a feet forward Network and that could be for example uh to a dense layer followed by some nonlinearity followed by another dense layer so this will go through here and produce the next representation and the next token will go through the same right this token here network and produce that representation and then to have another color this one will go through the same one and produce this one right so the feat forward Network applies individually to each of the tokens and a lot of parameters are actually here so all the talk about the attention mechanism is important because that's how the individual tokens exchange information and you make computation context dependent right otherwise it's just literally just a bag of words model uh however while attention is very interesting when you talk about context length and so on memory requirements for forward passes a ton of parameters are actually in these feet forward networks because essentially they take in a vector of a token and they multiply it by a giant weight Matrix so the this Vector here the hidden Dimension I think they have it somewhere detailed here so the hidden Dimension or the dimension ality of embeddings will be like 4,000 and then the hidden Dimension if I see this correctly will be 14,000 if I interpret this correctly that means that this thing here will be 4,000 Dimensions so every single token will be a vector of 4,000 Dimension and then this here will be um sorry blowing that Dimension up 4K to uh what was it 14k then there will be a nonlinear 30 then it will be another one going 14 sorry going 14k K to 4K and then it will go on so 14,000 * 4,000 that's already like 50 million parameters or something in a single one of the these matrices so these are big chunky things um and it also means since there is only one per block or well if you repeat these there two per block but since there's only one per block it means that you have to treat every single token the same right every single token goes through this weight Matrix and then therefore the

### [11:15](https://www.youtube.com/watch?v=mwO6v4BlgZQ&t=675s) Expert Routing

weight Matrix therefore the transformation is the same for every single token and that is what mixture of experts aims to change mixture of experts essentially says hey wait a second what if we not only had one of these but multiple ones right so there's this these are the w1s then we have the W2s w3s and so on now we can make the individual ones smaller right if we want if we still want to retain the same parameter count we're free to do that but the main part is that if I now have a vector here I can put it push it through each of these functions individually and then combine their output at the end right now why would I do that um if you just do it like this you distribute you compute you gather again there might already be some Advantage notably we know that from multi-head attention right the attention layer is also has been split up like this so every attention layer will have multiple attention heads which collect different signals uh and then they'll be gathered together again so there could be already advantages due to that however this paper goes further they say we'll do a sparse mixture of experts so instead of sending it to each of these computation paths and we'll call these here will be the experts um so this there'll be expert one expert 2 expert 3 instead of sending it to each of those experts we only send it to a subset of experts uh so how about we only send it to one expert so this token goes here right uh but then the green token for some reason will send here and so two tokens Can Go different ways now what decides where each token goes that will be another small so-called routing neural network they'll call that g a routing neural network will decide where a particular token goes to which expert a particular token should be routed in fact they always select two experts and I believe we've talked about this in the papers on like gard and things like this that there are stability reasons why you would want to Route the same token to at least two experts uh so if you want you can go watch that video on mixture of experts routing on sparse but in any case the G is just another neural network remember what we're doing is we are taking the intermediate vector representation of a single token and we'll try to decide to which of the fixed size expert or of which of the expert out of a fixed set of experts this particular token goes so that's like a classic classification problem right so I have a have my feature Vector X which is a vector and I push it through some sort of a function and that gives me one of I don't know and different class labels or outputs or logits right so um and then I can just take the highest one or the highest two and then route my token there and if these are logits they even give me let's say this is big this is small after I exponent after I soft Max my log it soft Max my logit I'll get some sort of a distribution here or at least a waiting and so I can see ah okay a lot of this expert and a little bit of that one so let me send it to these two experts gather their signal and then collect the output according to these weightings right so I'll do a weighted sum at the end depending on what this function told me how I should route this particular token for the next token you're going to do the exact same thing and this F function will tell will give you some different outputs some different weightings of expert so not only is the signal different that you pass through this layer but the path the computation takes is different for each signal and the routing itself is also determined by the signal so you'll notice the same X that is being routed through the computation over here is also used to decide where it should be routed in the first place it's a bit like I don't know you get you got people to distribute to different jobs and you look at them and you'll be like well you're very tall you go do the thing where you need to reach high shelves right and then that person goes there and does that job um so the person itself is used to decide where the person goes and then the person does something else than all the other person do that's I don't have better way to explain that so the F here it will also be like a small neuron Network that decides where stuff goes ultimately if you look at this in terms of uh mathy math you'll see if x here is the intermediate representation in a given layer of a given token you'll see that we put it through the routing Network and through the expert Network and then at the end we'll build a weighted uh a weighted sum according to what the

### [17:00](https://www.youtube.com/watch?v=mwO6v4BlgZQ&t=1020s) Sparse Expert Routing

router said um and what the expert output now obviously the trick is that if this routing function outputs something sparse um for example if you build it like this if you have this top K in here the output will be sparse which means it will be zero for most of the experts you don't even have to compute a lot of these experts and that's the whole trick that's how you save on computation um and make the active parameter count for each token a lot smaller and have different tokens use different experts notably I don't see any sort of entropy regularization or so that would kind of guarantee that different things different tokens are getting routed to different experts I guess it's just not necessary but I do believe that some of the uh initial mixture of experts papers did have such terms in their loss but if it's not necessary I guess that's fine too um the other thing here so they say uh EI denotes the output of the I of expert okay so I are the number of experts and they go to N I guess they're n experts and then here GX subi denotes the N dimensional output of the gating Network for the if of expert this I believe is a mistake in the paper um um maybe not but so if you think of it if like this thing here outputs a vector right uh and you ultimately sum these vectors and you want another Vector than these things here they must either be scalers or matrices but they cannot be like n dimensional the output of the gating Network for the if expert cannot be n dimensional so I believe what they meant to say is that g of X has an N dimensional output and then GX of sub I is the ith entry of that end dimensional output which this n dimensional output is just the what I said before this kind of classification layer where you only take the top K in their case the top two entries before so you set everything else to zero uh and then you normalize that using a soft Max um and I believe what they say here so the neural network is just a linear feed forward layer uh so I I think that is correct because if x is a vector this is a matrix you'll get a vector you get the vector is hopefully n dimensional you get two and so small mistake I was quite confused when I read it but I do believe the paper is bit wrong um yeah they make a distinction here and they say hey okay if since we're doing this kind of routing and since we have this sparse element in there we should distinguish between the model's total parameter count which is also the so-called sparse parameter count which grows with n obviously as you add experts you add parameters because as you have more and more of these computation paths and each one has their own um weight matrixes or matrices uh note still like don't be confused we're still so there is a function FF let's call it FF but it's no longer called FF but there is a function FF and we apply it to each token individually right individually which means that every single token goes through the same function in here right and that function is always the same function so that hasn't changed what has changed is that internally inside of these functions if you peer in then the input X might take a different path than the input y so the input y might take a different computation path in here and activate different parameters inside of this but it's still the case each token is pushed separately through that feat forward stage it's just that inside of that feed forward stage we have some sparse elements and depending on the signal that we put in there depending on the token gets routed differently they say yeah so the other thing is the active parameter count um which is the number of parameters used for processing an individual token which grows with k so the more experts you consider per token

### [22:00](https://www.youtube.com/watch?v=mwO6v4BlgZQ&t=1320s) Expert Parallelism

the more obviously work you do so the trick here is to only use two out of the eight experts for each token which immediately divides the number of active the number of parameters per token inside of these feet forward layers by four uh they also say that this can be used to do expert parallelism now expert parallelism if you're doing really high throughput stuff you put each of those experts like this is W1 uh this is W2 this is W3 so there might be two matrices here and there might be some nonlinearities whatnot but you put each of these experts onto different gpus so just GPU 1 this gpu2 and so on and therefore to each GPU it looks like a dense operation and it's just that the router here decides to which GPU you send the to token this obviously only works if you have a high throughput right if you kind of pipeline your signals and then you have maybe a bit of queueing here and some coordination and so on or very high batch sizes and so on but um if you have that throughput then you can Shard the model like this so every expert to a GPU and then for each GPU individually it no longer looks like a sparse operation right for each GPU individually it looks like a dense operation and obviously gpus are very good at dense operations um so you can massively speed up the throughput like assume these were regular feet forward layer and you just do model parallelism this would also be a dense operation per GPU but every GPU would have to kind of treat every token or essentially every token goes everywhere if you're not doing a sparse mixture of experts so that's how you achieve higher throughput uh by if you have 3,000 token you only go 2,000 go here and 2,000 go here because you have two experts per token whereas if you doing fully feed forward Network it would be 3,000 token and 3,000 token per expert let's say uh so that's how you increase through put with spur good uh this is all of machine learning has come down to this right here that that's the magic uh which essentially just says this is the the new feet forward layer which essentially just says okay there is routing we'll take the top two uh we'll make the soft Max over the top two and that is going to be our waiting for how we actually aggregate what is the old the all the classic feed forward network of a

### [25:00](https://www.youtube.com/watch?v=mwO6v4BlgZQ&t=1500s) Experimental Results

classic Transformer and that's it results I don't want to go too much into results and I find experimental results are obviously interesting if you actually want to compare different things but in this case they do just keep on par or outperform these other models such as the Llama 270 billion parameter model and also uh chat GPT so G PT 3. 5 um you can it depends a bit so these plots you know always be careful so here they say active parameters right and they compare to like mistol 7B to their mixture of experts and the red one here is llama 2 it's a tiny bit deceptive obviously it is correct uh the active parameters for each token are different and that determines how fast you can do a forward pass at inference time however they're not always the same active parameters right so the increased performance here it comes from the fact that D is dynamically decided which parameters belong to the active parameter set for each token and therefore these plots they're still they're very good obviously but you have to keep that in mind when you compare it to a model such as llama 2 which does not do a selection of active parameters where it's just like all the parameters are always the active parameters yeah so yada yada reasoning retrieval so they show that it can kind of retrieve this p in this P key task that is going to be it's a task where you have to retrieve something from the context window depending on where it is they show it can retrieve it everywhere which I do find it's a good Benchmark to compare to others who haven't passed it but honestly if everything is green The Benchmark ceases to be useful no like make it harder like show me where this breaks yeah in any case they also show that across their context length that they train with their perplexity does decrease on a test set and that essentially means that it can still make use of the context size so what would be bad is If This Were to go up again which would essentially mean that as you add stuff into the context even though you can technically uh you're just making things more noisy but that doesn't happen so that's good to see but what's also fairly obvious from this plot is that the biggest gain you do have uh in the first few things that you add to the context and that's noteworthy because it still means that it's a better Strate if you can smartly select what you put into the context it is a much better strategy than just throwing everything you have into the context um so for this particular task we see it doesn't hurt to put a lot of more stuff into the context but I would argue just from experience very often it is probably going to hurt to put more stuff into the context um so keep that in mind and still try to think of how can you smartly select what goes into context bias benchmarks who cares maybe I guess maybe the European Union or something they're they're about three years behind everyone else's Collective Consciousness so they'll probably make some regulations according to bias and they say they do supervised fine tring so after pre-training on multilingual data they do supervised fine training on an instruction data set wow followed by direct preference optimization on hold your horses a paired feedback data set yeah yep no I can totally go out and reproduce this that this is so informative um yeah so I guess double-edged sword right we are all super happy that they do this they release the models under aachi which I think not only is it open source like it's a huge service to the community because they released this under Apache and nothing has happened the world is still here it's not burning people aren't smashing I mean some in some places people aren't significantly smashing in each other's head more than if this model had been released under some stupid open rail license right I think that service to the Comm commun that's the biggest service to the community to show look we can release something fully freely and nothing bad's going to happen or at least nothing worse than had we not released it uh in the first place or had we released it under some dumb license so super duper thankful for that is extremely commendable however you can also definitely see where they think their let's say business value comes from or their business risk I'm not sure which one right here but they clearly don't say a slightest tone about the data set this was trained on um maybe it's just to to troll the complainers they could be cool enough to do that um maybe it's just like maybe it's just the pile uh like super obvious but they don't say it just so everyone of the complainer Crowd Goes Like and then after six months they be like oh yeah it's just a pile all right uh this is pretty cool they do

### [31:30](https://www.youtube.com/watch?v=mwO6v4BlgZQ&t=1890s) Routing Analysis

an analysis of the routings obviously given that you route different tokens to these different experts can you somehow determine do the experts specialize in different sub kind of obvious sub Fields they say surprisingly we do not observe obvious patterns in the assignments of expert based on the topic the only big thing they notice is that consecutive tokens are often assigned to the same experts plus some regularities such as okay all the in like python code all the leading tokens all the whitespace tokens are out to the same experts but other than that there doesn't seem to be a super obvious pattern um now this could mean two things it could mean that uh there isn't really semantic patterns it may be more about different things uh so different uh aspects of the tokens that are important for routing or second that the pattern that are patterns that are considered are so over our heads in terms of understanding them that we just can't make sense out of them both are absolutely possible it's probably a mixture of both um so usually if it goes away from the narrow slice of human semantics then uh humans have a hard time understanding it even though it could be a patter that's kind of on the similar level of obviousness as oh every uh integer variable is being routed here and every assignment character is routed here and so on um but yeah maybe we'll maybe eventually we'll find regularities or maybe we won't that's still loud but it's cool

### [33:20](https://www.youtube.com/watch?v=mwO6v4BlgZQ&t=2000s) Conclusion

that they've investigated and they have some more in the some more analysis in the appendix in case you're interested in that and that was pretty much it for this paper as I said we're all I think it's super cool they just do this and they just release it on Apachi so it means that a lot of people can build things can build uh new cool applications and so on um not saying where the data is from I believe is not very scientific but Al ultimately probably smart and uh either yeah because of incoming regulation or because it's actually their kind of business Val their value as a business I don't know but it's probably a smart idea and yeah let me know what you think and the best applications so far are of Mixr um it's really cool and excited to see how the world goes forward with more and more open source a that's it for me thank you for listening I'll see you around bye-bye

---
*Источник: https://ekstraktznaniy.ru/video/12152*