# How Language Models Choose the Next Word

## Метаданные

- **Канал:** AssemblyAI
- **YouTube:** https://www.youtube.com/watch?v=vQbSBdJ1Irw
- **Дата:** 28.10.2024
- **Длительность:** 8:59
- **Просмотры:** 3,192

## Описание

Are Large Language Models (LLMs) just advanced versions of autocomplete? While some AI experts describe them as “next-word predictors” this is an oversimplification. In this video, we’ll dive deep into how LLMs (like ChatGPT, Claude, and Gemini) actually choose the next word when generating text.

We’ll explore the difference between the modeling and decoding phases, and how decoding strategies—such as greedy search and beam search—impact the quality and creativity of a model’s output. 

Video sections:
00:00 Are LLMs just autocomplete?
00:20 Token selection algorithms
01:32 Modeling vs. Decoding
03:13 What is language model decoding?
04:12 The probability of text
05:41 Greedy Search
06:49 Beam Search
07:54 What is neural text degeneration?

▬▬▬▬▬▬▬▬▬▬▬▬ CONNECT ▬▬▬▬▬▬▬▬▬▬▬▬

🖥️ Website: https://www.assemblyai.com
🐦 Twitter: https://twitter.com/AssemblyAI
🦾 Discord: https://discord.gg/Cd8MyVJAXd
▶️  Subscribe: https://www.youtube.com/c/AssemblyAI?sub_confirmation=1
🔥 We're hiring! Check our open roles: https://www.assemblyai.com/careers

🔑 Get your AssemblyAI API key here: https://www.assemblyai.com/?utm_source=youtube&utm_medium=referral&utm_campaign=yt_marco_3

▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬

#MachineLearning #DeepLearning #llms #algorithm

## Содержание

### [0:00](https://www.youtube.com/watch?v=vQbSBdJ1Irw) Are LLMs just autocomplete?

large language models or llms completely dominate today's conversation around the AI quite often even some AI insiders refer to llms as massive AOC completion algorithms that simply output the next most likely word but is this simplistic description accurate at all this video is going to give you everything you need to know to understand how llms actually choose the next word with the rise of

### [0:20](https://www.youtube.com/watch?v=vQbSBdJ1Irw&t=20s) Token selection algorithms

Frontier language models like chat clothe and Gemini developers have been focused on finding the optimal model among various API offerings and popular open source choices such as mat as llama models much of this effort has been spent comparing model performance on platforms like the chatut arena and examining Benchmark results self-reported by the companies Behind These models however one critical area has remained out of the Limelight decoding strategies the algorithms that determine how language models generate text both industry and academic research have largely focused on other areas such as prompt engineering although useful prompt engineering techniques are developed on top of a Bedrock of anecdotal findings and might even become obsolete with Future model iterations indeed this might already be the case for the newest 01 model launched by openai on the other hand deeper experimentation with token selection algorithms has been largely overlooked these algorithms set the decision rules used to extract text strings from a model's probability estimates and no matter how model evolve or training paradigms change the coding strategies will remain a key component to language modelbased text generation so to set the ground for our discussion let's recall some fundamentals of llms at a high level in natural language modeling we

### [1:32](https://www.youtube.com/watch?v=vQbSBdJ1Irw&t=92s) Modeling vs. Decoding

distinguish two separate phases the modeling phase how llms learn during training and the decoding phase how they generate text during training llms optimize an objective function to estimate the probability of the next word in a text sequence and this process results in a statistical model of language we can essentially summarize the Breakthrough in llm development of the last few years as the following empirical Discovery optimizing this simple objective function through massive model SC in both computational and in the amount of training data effectively models human language and this partly explains why llms are sometimes described as next word predictors but this description can lead to some confusion about the inner workings of llms in the decoding phase when generating text llms can in fact employ a variety of algorithmic strategies that utilize their internal statistical model of language for text generation in other words a trained llm provides a mathematical function that acts as a probability estimator Within any specific text generation algorithm that is a decoding strategy so the correct way to think about llm decoding is rather the following an llm can be used to explore or search the space of all possible text sequences in theory even at different levels of granularity think words sentences and paragraphs or even more abstractly differentiating between drafting planning developing phases the specific method used to search this space is called a decoding strategy and the choice of a particular decoding strategist can impact a model's quality along different dimensions from task specific performance where different strategies may be more suitable for Creative tasks versus more predictable or structured outputs to inference speed which determines the computational cost per generated word for a model of a given size language

### [3:13](https://www.youtube.com/watch?v=vQbSBdJ1Irw&t=193s) What is language model decoding?

model decoding is the process by which n llm generates text and with text here we really mean the series of symbols or tokens that represent it in fact the Transformer which is the underlying architecture of all current language models is really a general tool that can input sequence of tokens and output sequences of tokens whether these tokens are meant to represent words image data genetic sequences audio or other forms of coherent signals is a matter of design for open-ended text generation the goal of llmd coding can be abstractly described as follows given an input sequence of tokens s we want to choose continuation of n addition of tokens that form the completed sequence S Prime this continuation should be contextually coherent as an extension of the given input but how do we quantify contextual coh appearence for this continuation we do so by using the probability function provided by a language model trained on human generated text sequences an llm estimates the probability distribution over the vocabulary of tokens given any sequence s so a language model can be

### [4:12](https://www.youtube.com/watch?v=vQbSBdJ1Irw&t=252s) The probability of text

pictured as a function that Maps a token sequence to a vector of probabilities what are the numbers or entries in this Vector well for any token X in the model's vocabulary the list of all possible tokens we have the probability of X conditioned on the sequence s this number presents the likelihood that the token X would follow the sequence s with this we can compute the probability of any completed sequence any sentence if you want based on the probability chain rule where the probability of a sequence is decomposed into a product of probabilities of a single token conditioned to a sequence note that this approach to Computing text probabilities really defines causal language modeling other approaches exist but this one is the dominant method for text generation now with this Foundation we can describe the general schema of any decoding strategy in the context of open-ended text generation step one given the current context sequence s sample a token X from the model's distribution condition on the sequence s step two update the context sequence to S Prime s followed by X and then repeat steps 1 and two until a termination condition is me typically this happens when the model predicts a special token a symbol that represents the end of sequence now in this General framework the choice of method by which we sample X from the distribution in Step One is precisely what distinguishes is different strategies so let's get more concrete now and see two basic examples of deterministic methods deterministic here means that the same input always generates the same output so there is no Randomness in the process the simplest

### [5:41](https://www.youtube.com/watch?v=vQbSBdJ1Irw&t=341s) Greedy Search

decoding strategy for language models is called greedy search it is the most straightforward Approach at each step choose the token X as the most likely token given the context sequence s spoiler alert despite what many people think this is not how current conversational language models actually produce text it can be is slightly counterintuitive but note that this strategy doesn't necessarily produce the most likely overall sequence it just picks the most likely token at each individual step choosing a less probable token at one step could well lead to a more probable overall sequence in subsequent steps to visualize these imagine the process as exploring a probability tree gitty search only explores a single branch of this tree the one that seems most promising at each step to find the actual most likely sequence one would have to explore the full tree of all possible token combinations which is computationally invisible for any practical text length due to the size of the vocabulary often of the order of 30 to 60,000 tokens in fact despite its computational efficiency gitty search doesn't generally produce high quality text and it is empirically found to Output generic and D text in the setting of open-ended generation beam search is the

### [6:49](https://www.youtube.com/watch?v=vQbSBdJ1Irw&t=409s) Beam Search

natural generalization of greedy search offering a way to explore multiple branches of the probability tree instead of just picking the next token with the highest probability it maintains a whole beam of the K most probable sequences at each time step where K is referred to as the beam width depending on the beam size this method produces higher quality text than greedy search but it can be slower due to more computations more precisely the algorithm works as follows consider the top K most probable next tokens for each sequence in the current beam where the initial beam consists of only the input sequence then expand each sequence with these K tokens creating K new candidate sequences for each existing sequence finally update the current Beam by selecting the top case sequences among these candidates by maintaining multiple candidate sequences beam search can effectively look ahead in the probability tree potentially finding higher probability sequences that gitty search might miss the beam width The Chosen parameter K determines the trade-off between the depth of Exploration with larger K and computational efficiency with smaller K however beam search often falls short in open-ended text generation tasks both

### [7:54](https://www.youtube.com/watch?v=vQbSBdJ1Irw&t=474s) What is neural text degeneration?

qualitative studies involving human preferences and quantitative analysis comparing the variance in human gener ated text versus beam search have revealed patterns of degeneration in beam search outputs in fact while beam search sequences generally have high likelihood scores they lack the natural variance and diversity characteristic of human written text this can manifest with frequent repetitions of phrases as well as a tendency to use a less varied vocabulary to examples of so-called neural text degeneration as this is referred to in the literature so the natural question is how can we introduce more variance and unpredictability into the generated text while still maintaining overall coherence and high likelihood it turns out that an essential step is to introduce some degree of Randomness into the decoding process this leads us to so-called stochastic methods for text generation which is the topic I covered in another video so that's all from me today I hope you enjoy the content and see you in the next video it may seem a little counterintuitive but after model training it turns out that in text generation or decoding Randomness is essential to allow a language models to search the Infinite Space of possible text sequences in an effective way

---
*Источник: https://ekstraktznaniy.ru/video/12565*