# Byte Latent Transformer: Patches Scale Better Than Tokens (Paper Explained)

## Метаданные

- **Канал:** Yannic Kilcher
- **YouTube:** https://www.youtube.com/watch?v=loaTGpqfctI
- **Дата:** 24.12.2024
- **Длительность:** 36:14
- **Просмотры:** 47,392
- **Источник:** https://ekstraktznaniy.ru/video/11871

## Описание

#tokenization #llm #meta

This paper does away with tokenization and creates an LLM architecture that operates on dynamically sized "patches" instead of tokens. By controlling the patch size, they gain a level of control over the tradeoff between model size and FLOPs and use that to achieve more favorable scaling behavior than classically tokenized LLMs.

Paper: https://ai.meta.com/research/publications/byte-latent-transformer-patches-scale-better-than-tokens/
Code: https://github.com/facebookresearch/blt

Abstract:
We introduce the Byte Latent Transformer (BLT), a new byte-level LLM architecture that, for the first time, matches tokenization-based LLM performance at scale with significant improvements in inference efficiency and robustness. BLT encodes bytes into dynamically sized patches, which serve as the primary units of computation. Patches are segmented dynamically based on the entropy of the next byte, allocating more compute and model capacity where increased data complexity d

## Транскрипт

### Segment 1 (00:00 - 05:00) []

hello there today we're looking at the paper bite lat and Transformer patches scale better than tokens this paper in a sense does away with classic fixed vocabulary based tokenization and in doing so develops a new architecture called The Bite latent Transformer and in their experiments they show that this as the paper says scales better than a sort of classic a model that operates on classic basically tokenized tokens so the thing that they're doing is they do away with tokenization they find a different way of splitting text into pieces a dynamic way and they call these pieces patches so patches are like tokens um except they need to distinguish them verbally so it's clear which one you're talking about and then once you let a model run on that you do get better scaling properties and that's kind of the central claim of this paper that if you look at models that are um here in they compare to bite pair encoding so this is kind of classic tokenization used in the Llama models if you compare that with a model that operates on patches um then you do get better scaling Behavior as you can see here so the red orange lines are the patch based ones the blue line is the kind of classically tokenized ones now there are a lot of like choices here that make this graphic so the Y AIS here is what they call bits per pite which is kind of so since you if you don't deal with tokens and if you deal with especially different tokenization you can't really use perplexity as a measure because that kind of necessitates that you operate over the same kind of fundamental pieces so bits per bite is sort of the analogous measure to entropy uh sorry to uh perplexity so think of this as kind of perplexity and then here are the x-axis and that's important that's total training flops so they always consider flop matched models uh because they are like their model operates differently it has like an outer layer and an inner layer and don't you don't need to always execute the inner layer for each outer step so the outer step ends up running more often but the inner less often and that's why you can uh achieve you can achieve uh better or you can achieve bigger models let's say with um with the patch based models because your patches are bigger and you need to run them less often so if you invest the same amount of training flops then you can kind of become better so there is a lot of ifs in this kind of oh it scales better what they keep constant is the training flops so for the same amount of training flops after a certain threshold you'll do better with the patch based models just because they have better scaling property um why exactly that is that could be it's probably a mixture of the part of their architecture so let's dive into that architecture here is the bite latent Transformer as I said it's kind of a two tier system so the inner system right here that is um your very let's say regular llm type Transformer there's absolutely nothing special about it except it that it operates on kind of these pieces right here now usually these would be token and token embeddings so these here would be token embeddings and it predicts the next token um if you view a Transformer so you have I don't know token and the history going in and then you have a Transformer and it will output uh a distribution over next tokens right like a some sort of a soft Max probability distribution over next tokens if you however take one step back and you consider what happens in the last layer you do have a an so here is the model out comes an embedding so this is like the hidden signal in the second last layer the last layer is a matrix that is of Dimension um here this Dimension is the size of H and then the um token or of the vocabulary let's say

### Segment 2 (05:00 - 10:00) [5:00]

vocab right so this is what actually does the classification multiplying these two things together will and then applying a soft Max will give you this distribution so in a sense um you could argue that uh the that this here you know that what the Transformer actually does is even a regular Transformer it actually predicts the embedding of the next token and there is one more thing if you do what's called weight tying or embedding tying so here is the tokens coming in here is your embedding table so for each token you have an embedding in here some models actually tie those together meaning they use the same parameters saves them a lot of parameters and it kind of has the same idea um this maps from token IDs to embedding space and this uh sorry this Matrix here kind of maps from embedding space to back to token IDs right with this probability distribution so in that sense it's even more true that the latent Transformer just kind of out predicts the embedding or sorry any language model uh Transformer that is you know has at the end uh outputs loged actually just predicts effectively the embedding of the next token so that's you know being said in the inner side here there is just a very regular Transformer Auto regressive llm that takes in things and predicts the next thing from them all in embedding space now usually again usually these are tokens so maybe it's worth briefly how uh we get to those tokens so if we have a piece of text for example uh this piece of text addess here data is primarily determined by the number okay what we want to do is we want to split those things up into individual pieces that our models can operate over now one method of doing that would be to just split every single character including the white spaces right becomes one piece but that's not really uh the best because it will result in very long sequences for a given text and you know that Transformers scale by sequence length quadratically which isn't necessarily doesn't make us happy so our context window of 128,000 tokens will just be 128,000 characters in the end so can we do better well yes we could split uh by for example whites space so data sorry data becomes one token is becomes one token primarily becomes one token and so on this was very standard for a very long time and what you have to do in all of these things if you operate with tokens is you're going to have a table that is um mapping your tokens to an embedding as we said before and then every token needs to have a corresponding embedding Vector that you can look up so the word data has to have an embedding vector in here somehow uh the word is right um and you can already see the problem there is this table is become going to become really big and the even bigger problem would be that let's say you derive you have to derive this table somehow so you take a training Corpus and you look at all the words in there and that's how you initialize the table but it's very likely like English is such a big language that in your test data set there's going to be a word that you've never seen in the training data set like a name or maybe a number or just a word that you've never seen uh for example you might actually never have seen the word determined before there is some people have tried to mitigate some so they what they do is like stemming or something like this so instead of determined you just say determine um so you and you say oh those two are the same word essentially so you only have one entry in the embedding table instead of having one for determine determined determining uh determinization and whatnot so this is just one token but still the problem of like out of vocabulary people used to call that was really big and problematic and people came up with

### Segment 3 (10:00 - 15:00) [10:00]

alternatives to that and those alternatives are what currently are very popular um so what are those Alternatives if you look at things like bite pair encoding or word piece encoding or things like that uh they all are of the same principle they say there exists um a set of like unitary things and those unitary Things Are they can be used to make up all of the text that we see so in word piece um those unitary things would be just the all the characters that exist so a b c d da d da until like Z then capital A then like zero then the question mark and so on like it's still a lot but it's not infinite right with a decent amount of single symbols you can represent any um any sequence of characters and you might want to say well aren't we now back to the same problem where character level isn't really good and that's yes okay so let's say we have we just do asky lowercase okay let's say we have a to z that's good so we can represent everything however we know that the combination e is very frequent in the language so let's just assign a different slot to ER yeah we still have e in here somewhere we still have R in here somewhere but if we encounter e we choose to represent it with its own token and its own embedding and then you go on and eventually you'll say oh maybe the maybe you know I don't know am is very common and Ur something like this and then you start making bigger combinations you say okay the d a d like Dad that's a very common thing in the language and so on so you build these things there are heuristic ways of deriving them uh it's essentially a compression algorithm if you will uh and you assign individual tokens um to those and you it's not just whole words right you can see like these things they're more like word pieces that you start building up the same with bite pair encoding uh where you just operate in the realm of bytes so uh you know you can encode any text into a series of bytes uh by different encoding standards that exist like utf8 is a very common one and then you literally you know what all the symbols are they are 0255 those are all your single bytes that can exist and then you start combining you know the bites that appear often in your text it's kind of more clean than working with character and symbols but those are your choices so that would be like the bip paring coding and this would more be like word piece or something like that um yeah so like this seems good but it has its own set of problems so first of all what are its set of problems first of all uh you know a couple of problems that stem from tokenization so for example if you have like numbers or something like if you have the number 2568 uh then that might actually get tokenized as the token 256 and 8 because 256 is very common uh number and then you know just add eight so the tokenizer is going for the minimum amount of tokens uh so that's a problem if you want to teach the neural network to multiply something because it will not see 2 5 6 8 it will see some token with the ID 89 and then 71 right it has no clue that you know these are made up of numbers or something like this and there are a bunch of other problems with tokenization what this paper also shows is that tokenization does result in fairly small chunks of text where you could go for bigger chunks of text but the problem is if you keep it all in a table if you want bigger chunks of text or obviously more combinations possible so you'll have to kind of your storage kind of explodes for this so that's why they say do we even need this table here that maybe we don't actually need it maybe we can get away with having a table Just For The Individual pieces like unitary things and we can come up with a scheme of how we com how we recombine those

### Segment 4 (15:00 - 20:00) [15:00]

things for those down here in kind of like a learned way like can we teach a neural network to take the embeddings of the individual constituents and come up with the embedding for higher order combinations because that would allow us to not even have a fixed set of higher order combinations but like kind of an arbitrary combination of higher order com um combinations and the neural network will just be able to produce an embedding for these on the Fly and then those could be the individual pieces we feed into the bigger llm right so it's not a Chara we're not doing a character level or a bite level llm um what we're doing is a two-stage process where we have a first stage that out of the bite embeddings produces what they call a patch embedding and a patch embedding is like a um six to8 character is long thing and that then gets fed into the llm now you'll realize what I said at the beginning this idea could actually totally be done using the tokenization we have right like you could just tokenize how we tokenize right now but just not have this big uh sorry embedding table but just do this sort of two-stage process where the first stage just builds your token embedding from the character embeddings that make up the token and then the second stage will actually go and or the second stage is your normal llm that operates on token embeddings however you know because they have this method they also say well we don't need a fixed vocabulary tokenization anymore right this here is a fixed vocabulary you derive it once your vocab because you need that table and then you tokenize all the text into this fixed vocabulary you don't have out of vocabulary anymore because you can you have the individual characters here so you can tokenize anything uh but still it's fixed so they say hey we have this process now we can do Dynamic tokenization and that's what they call patching they're again from the inside to the outside on the inside we have an llm that operates on they call Patch embeddings which are essentially just token embeddings except the tokens aren't fixed they are Dynamic groupings uh patches of characters or of bites in our case same sorry uh all non asky people and so you can see that once we know what the where the patch boundaries are and in this case here here are the patch boundaries right so this is a token and text down here gets divided into four tokens once we know what they are we can use this local encoder thing to look at the characters in the patch and give us a single patch embedding that we then feed to the Transformer so the local encoder is a model that's trained to do exactly that um as far as I can tell it's trained end to end together with the latent Transformer and then the local decoder takes a patch embedding and decodes it into the constituent characters so you can see that the local encoder and the local decoder they run more often than the latent Transformer and now you have a degree of Freedom the long the bigger you make these patches the Wider they become the more characters on average to a patch the more often you run the local encoder in comparison to running the chunky latent Transformer so you can make this in here bigger if you make these smaller then you still gain a lot like you can gain a lot of flops um because you have to run the inner part less because you make the patches larger and as long as the outer parts are kind of lightweight uh they don't matter and you can get away with having a bigger model because you spend less flops because you run it less often right some astute observers might have realized that hey you know this local decoder when does it

### Segment 5 (20:00 - 25:00) [20:00]

local this local decoder when does it know when to stop um it you know it's just it gives it gets one thing and it's just supposed to produce uh tokens like characters from it we'll get to that in just a bit and the second part is obviously how do we know where the patch boundaries are how do you know how to group the characters into tokens and the answer to these two things is kind of the same and that's with their what they call uh entropy based grouping of bytes into patches um so the entropy based grouping is a concept that's as I said kind of um yeah it's what they essentially do is they train a small transformer so a B level Transformer um notably this is not this thing right here so they have a separate llm that's small that's just on the bytes so that actually is a character level llm that's just trained on a corpus and that decides where to split in the following way you feed text into it will predict the next token and if the entropy of the prediction so this distribution right here of the next character is very high meaning like what is a high entropy is a distribution that's like you know could be any of these whereas a low entropy distribution is like oh it's this one it's this one definitely so high entropy meaning it's not sure that's where you split so if the next character is above a threshold of entropy in the prediction of this bite level llm that's where you make a split that that's just a decision they make right um it's a design choice that they make but there's good reason right there's good reason to split by entropy uh because what you do is you keep the stuff together where you're sure so whenever you know bet you know bet like the erer that's very clear and therefore you want to keep it together because it kind of is one unit like whenever you're very sure what comes you can very much argue that the thing is actually should be treated as a single unit when you're not sure that means there could be multiple continuations that's when you want to split it up and say oh well here you know this these two things need to be treated separately because in an alternative Universe there there's a different continuation here that I need to take into account and then you better off if that first part is the same token each time and not if the entire thing is like a different token and you know nothing anymore all right um what I want to say yeah and this is also the answer on how the local decoder stops decoding so it decodes decodes and when the next and then it just always asks this small llm here what's the entropy of what I'm doing right like the next token in your estimation like this local model this knows nothing of the lat Transformer what it just looks at the stuff that's being produced and if the next token according to it has a high entropy that's where we end the patch okay so the process is as follows we have some text and let's say we're at a new patch boundary okay the local encoder looks at the patch sorry we're here let's start it the we run the small llm forward right boop boop boop until the entropy threshold is above that's where we say ah okay that's a patch okay our patch is from here to here then that local encoder looks at the characters in here and takes there is an there's a embedding table from bite to embedding notably you only need 20 56 entries fixed right this doesn't grow so it looks up the embeddings of the constituents and Aggregates them into a patch and Bing it's trained to do that then you run the latent transformer for one step

### Segment 6 (25:00 - 30:00) [25:00]

let's assume this doesn't exist yet for one step and produce the next latent um output token the local decoder takes this and starts um let's assume that actually let's assume the local decoder is here currently right the local decoder takes this and starts producing uh um uh tokens it starts decoding like an llm except conditioned on This Global signal light here so it's like okay this one okay and I'm produce this one and each time it asks the small llm what it thinks about the next token in the sequence it has decoded if the small as soon as the small llm says oh wait the entropy is quite High then it's like okay stop it here I'm going to stop it here please go back to the next thing um and uh you know start the next cycle of the process we almost at least that's how I think it goes uh maybe I'm totally wrong but that's what I can read from the paper is a bit sparse on these exact details um but and I haven't read the code I have to apologize for that but the code is available so you can go and verify or refute that um there is one extra thing there's one little bit of extra info that you need right here and that's usually when you do auto regressive decoding you take what you've produced and you feed it back right um into your own model however that doesn't work here because this local decoder it doesn't take text as a an input it doesn't take characters as an input it just takes this signal right here as an input so what does take characters as an input well that local encoder thing takes characters as an input so there is a hidden skip connection from like here to here so when you when the local decoder produces a character at least that's again my understanding you run this thing through the local encoder you know here get its local encoder embedding but you don't go to the latent Transformer because you're not done with a patch yet you just feed this back into the local decoder which then has like a latent a latent representation that it can decode the next token from so the loop between local decoder go to local encoder go to local decoder that's kind of the outer loop that runs in order to produce these tokens and once you're done with a patch then you know you start again to ask the local decoder about the next patch um to or sorry about the patch that you've just produced embed it get it into the latent Transformer from that you get next Global signal and then you do that outer loop again in order to produce the individual bytes until the small LM says again patches over again that's how I personally understand it there is yeah so here we have exactly we have the encoder decoder so um the encoder gets B embeddings uh uses and then uses cross attention so it knows it those should be um tokenized into three different patches so it uses cross attention from the patch um to Only The Tokens that are part of the batch by the way there are two here and not so they're three different patches but they use multi-head attenion so this just represents a two-headed uh multi-head tension with keys into here but you still have hidden states you have many layers so and these hidden States is what you give to the decoder um which does the exact opposite so its keys are sorry its queries are the individual bites that you produce and its keys and values are the global signal that you get from the latent Transformer all right there is one more thing now I'm going to guess that this thing here the encoder hash NR embeddings they added because it just works better like this seems very much like a thing you add after that so they say look we do have we have a

### Segment 7 (30:00 - 35:00) [30:00]

um we model each bite individually so when we do encoding each bite gets like encoded um by itself and as part of a bite and gram so you can see that they build up not just embedding tables or the bite to embedding but they build up several embedding tables so there is an embedding table um for bite two or three G um there is one for bite four G for bite 5 G and so on up until bite 8 G and now you ask well aren't the bite 8 GS huge and that's exactly what we Tred to avoid yes they are that's why you just kind of hash them and then modulus by the size of the embedding table so you're like you're essentially counting on the fact that yes there are going to be hash collisions like some of the bite three Gs are going to hit the same embedding right here but those hash collisions are kind of orthogonal things in meaning and so it's probably fine um I'm going to guess it's just a way to get NRS in there so when you look at a bite for example the letter T right here you also take the embedding for the 3 G the 4 G the 5 G the six G the seven G and the 8 G in front of that bite and you aggregate all of these together into the bite embedding so to say so the local encoder doesn't operate purely on the bite embedding as I said before but it actually operates on a super position of engr bite engram embeddings that this puts this into context with the bites before it that to me it just seems like a kind of a way to get kind of fake it's it's a bit of like you get you like tokenization is back that's what it tells me except instead of tokens it's NRS so yeah make of that as you will I don't want to you know talk too much more I think that's kind of it for the model design and how they decode and so on when they experiment around they find they can actually make larger patches than regular tokenization so they um they say look our patches we can go to patch sizes of like uh what do I say look Trends between yeah so they can go they can achieve kind of performance perance of like llama 2 and llama 3 models while using significantly larger patch sizes so while llama 2 and llama 3 B par and codings have an average token size of 3. 7 and 4. 4 bytes so we can achieve similar tra scaling Trends with an average patch size of six and even eight bytes um so you have that handle on that tradeoff and that's pretty cool I have to say they do some experiments where they show that yeah they can remain competitive with these LS models but also they're a lot better in you know in tasks where you actually need to look at the individual characters in a token because given that they operate on bite embeddings they can now also you know very fine Greenly train models that are actually need to look at the individual things whereas if you obviously just have fixed tokens and you look up their embeddings in a table that doesn't work as well so but it's kind of like it's kind of cheesing a bit but just demonstrating hey look spelling inverse were doing like really well compared to the Llama models which was to be expected but it is nice that they perform an experiment to actually show that what's also interesting is that um translation works better for kind of languages that are under represented or that are you know kind of tokenized in a non like Say Non in a way other than like your standard languages are tokenized and that's also pretty cool all right that's I want to don't want to dive too much here more uh please look at the rest of the paper it's pretty interesting it's pretty thorough the experiments are pretty cool and they pay a lot of attention to like control for various parameters in because it is really hard if you know your model operates on different fundament to you how do you even compare to other models

### Segment 8 (35:00 - 36:00) [35:00]

and they do good job at that um there are several room for improvements notably you could train more things jointly for example that small language model that does the patching and so on and as of now this in terms of um in terms of like raw runtime uh this still lags behind because obviously we've spent like a decade hyper optimizing or at least half fixed tokenization autoaggressive llms uh yeah with things like they name here such as Flex attention um and we and obviously that would still need to be done for these patch level models in terms of actually getting their runtime there so when they compare something they like match flops which is probably a pretty good measure that's kind of independent of raw optimization all right that's it as I said read the paper uh subscribe for more reviews and thank you so much if you read this as it comes out then Holly Jolly uh Christmas and uh Happy New Year and see you around bye-bye