# What is Speculative Sampling? | Boosting LLM inference speed

## Метаданные

- **Канал:** AssemblyAI
- **YouTube:** https://www.youtube.com/watch?v=2ouqE9g6oeM
- **Дата:** 20.11.2024
- **Длительность:** 6:17
- **Просмотры:** 3,890

## Описание

Speculative Sampling is a decoding strategy that yields 2-3x speedups in LLM inference by generating multiple tokens per model pass and, most importantly, without changes to the final output. Learn what is speculative sampling and how it works in this video explainer.

References: 
[1] Create Your Own AI Agent (tutorial) https://youtu.be/Q7KhrSbEnSQ
[2] Typical sampling (video) https://youtu.be/a-6hVvU1WMk?t=423
[3] Google Research (paper) https://arxiv.org/pdf/2211.17192
[4] DeepMind (paper) https://arxiv.org/pdf/2302.01318
[5] What is rejection sampling https://en.wikipedia.org/wiki/Rejection_sampling


Video sections:
00:00 Speeding up LLM inference
01:27 What is speculative sampling
02:37 How speculative sampling works
03:52 Inference speed analysis
05:14 Preserving output quality

▬▬▬▬▬▬▬▬▬▬▬▬ CONNECT ▬▬▬▬▬▬▬▬▬▬▬▬

🖥️ Website: https://www.assemblyai.com
🐦 Twitter: https://twitter.com/AssemblyAI
🦾 Discord: https://discord.gg/Cd8MyVJAXd
▶️  Subscribe: https://www.youtube.com/c/AssemblyAI?sub_confirmation=1
🔥 We're hiring! Check our open roles: https://www.assemblyai.com/careers

🔑 Get your AssemblyAI API key here: https://www.assemblyai.com/?utm_source=youtube&utm_medium=referral&utm_campaign=yt_marco_4

▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬

#MachineLearning #DeepLearning #chatgpt

## Содержание

### [0:00](https://www.youtube.com/watch?v=2ouqE9g6oeM) Speeding up LLM inference

as language model grow more powerful the challenge of speeding up inference that is generating text faster becomes essential especially when it comes to scaling up AI applications like workflow Automation and agentic AI systems in previous videos we've covered some of the most common llm decoding strategies focusing primarily on improving output quality but another critical Factor comes into play inference speed in this video we're diving into a speculative sampling a cool recent technique developed by gole research and deep mind that offers 2 to 3x speed UPS in llm text generation with no change in output quality we'll explore how it works why it's effective and how speculative sampling is reshaping llm deployment by making mods faster and therefore smarter or what do I mean by that well beyond the obvious benefits of reducing costs and energy footprint of the current usage improving llm efficiency at inference is largely motivated by the following widespread idea being able to scale up the compute used for llm inference while maintaining fixed costs could unlock applications and capabilities possibly still Out Of Reach right now in the size regime of the most capable models a concrete example is open ai's latest model 01 where the underlying llm generates thinking tokens behind the scenes but there are lots of other examples in the literature around agentic AI systems and by the way if you're looking for a step-by-step tutorial for creating your own llm based AI agent check the first linked video in the description as you might know all

### [1:27](https://www.youtube.com/watch?v=2ouqE9g6oeM&t=87s) What is speculative sampling

current llms rely on the Transformers architecture by Design the time to generate a single token with one model call or forward pass is proportional to the Transformer size in terms of both number of model parameters and memory requirements this means that as a general rule of thumb a model twice as big will require double the time to generate a sequence of the same length speculative sampling is a decoding strategy recently discovered independently by teams at Google research and deep mind that yields two to 3x speed UPS by generating multiple tokens per model pass and most important without changes to the final output the idea is to redistribute the model calls needed to generate a sequence between two models working in tandem instead of just one so normally we need one call per generated token to put it another way the exact same amount of compute one call is distributed equally between all tokens in a sequence however this counters the intuition that some tokens or words should be easier to predict than others the easy tokens so to speak should also be successfully predicted by a much smaller and therefore cheap model but the obvious question is how to preserve the quality of the original model while at the same time not failing at predicting the harder tokens let's

### [2:37](https://www.youtube.com/watch?v=2ouqE9g6oeM&t=157s) How speculative sampling works

see how speculative sampling comes as a really smart solution to this problem effectively it redesigns the token generation Loop and splits it into three separate steps in Step One a draft model the smaller faster model generates a fixed number of K tokens starting from the given context sequence for the sake of illustration let's stick to k equals to 5 which turns out to be the optimal value in the experiments therefore this requires five passes from this model not that on top of the five selected tokens also the corresponding five next token probability distributions from the model are cached in memory the second step is verification here a Target Model the original larger model is used as a verifier with a single forward pass it is evaluated on the whole sequence that is the prefix sequence plus the five draft tokens and this yields six next token distributions which are cached in memory one for the prefix sequence and one for each token position note that this step has required a single forward pass from the large model so far the third and final step is correction following a technique called rejection sampling the draft tokens are sequentially approved or rejected based on the ratio of probabilities from the Target and draft models if a token is rejected the process stops and the next token is sampled from an adjusted distribution derived from the target models prediction and now the question

### [3:52](https://www.youtube.com/watch?v=2ouqE9g6oeM&t=232s) Inference speed analysis

is why does using two models instead of one actually improve speed the key is in the verification step and relies on crucial feature of Transformer models the attention mechanism computes and caches the attention waits for all tokens simultaneously by using Matrix operation on the entire sequence enabling the model to process each token position concurrently within a single pass and this is what allows us to get all the next token distributions at once in the verification step by the large model now you may ask why do we do this cumbersome threep process instead of just sampling each token directly from the Target Model so to understand why this works let's analyze the best and worst case scenarios at each iteration in the best case all five draft tokens are accepted as valid by the Target Model generating five tokens per Target Model pass the computational cost of the draft model is considered negligible compared to the Target Model so this gives roughly a 5x speed up or slightly less over the Baseline in the worst case even if the very first token is rejected one token can still be generated using the target models distribution because remember this was the output of the verification step the cost now is essentially the same as the Baseline one Target Model pass per one generated tokens obviously the general expected case will always be somewhere in the middle so between a 1X and 5x Improvement and this is where the 2 to 3x average speed up in the reported experiments comes from and that sounds

### [5:14](https://www.youtube.com/watch?v=2ouqE9g6oeM&t=314s) Preserving output quality

great for inference speed but what about the quality of outputs compared to the Baseline the Target Model well back to the correction step note what the algorithm does the Target Model will approve tokens only if its own predictions match with the draft model and will reject the generated tokens otherwise intuitively this ensures that the final generated sequence aligns with the distribution of the Target Model even though we are not sampling from this distribution directly note also that this technique Works in combination with any other decoding strategy since we are free to choose the one used by the draft model so this one could be temperature sampling or typical sampling for example the decoding strategy based on information theory that we discussed in a previous video check the links in the description speculative sampling is a very cool technique to boost llm inference speed and if you want to learn the details I will also drop a link to both research papers below so I hope you enjoyed this and if you're looking into building your own voice mode type application to interact in real time with an nlm of your choice you should take a look at smitha's tutorial where she does this using Lama 3 and assembly a eyes API and see you in the next one

---
*Источник: https://ekstraktznaniy.ru/video/12542*