# How ChatGPT actually works

## Метаданные

- **Канал:** AssemblyAI
- **YouTube:** https://www.youtube.com/watch?v=x_bw_IHjCWU
- **Дата:** 23.01.2023
- **Длительность:** 6:25
- **Просмотры:** 34,435
- **Источник:** https://ekstraktznaniy.ru/video/12739

## Описание

Since its release, the public has been playing with ChatGPT and seeing what it can do, but how does ChatGPT actually work? While the details of its inner workings have not been published, we can piece together its functioning principles from recent research.

We are going to examine GPT-3's limitations and how they stem from its training process, before learning how Reinforcement Learning from Human Feedback (RLHF) works and understand how ChatGPT uses RLHF to overcome these issues. 

If you'd like to dive deeper or learn about at some of the limitations of this methodology, take a look at our comprehensive blog post:
https://www.assemblyai.com/blog/how-chatgpt-actually-works/

00:00 Overall look at how ChatGPT works
00:29 Misalignment issues
02:02 OpenAI's solution to misalignment
02:39 Step 1 - Creating the SFT model
03:27 Step 2 - Creating the Reward model
04:24 Step 3 - Finetuning the SFT model with the Reward model
04:56 Model evaluation
05:48 Further reading

What is ChatGPT?
Cha

## Транскрипт

### Overall look at how ChatGPT works []

touche Beauty was actually trained in quite a simple way the researchers took a big language model that could generate natural sounding sentences that did not exactly match what humans wanted and then fine-tuned it with human feedback but of course the details are a little bit more complicated chapter is based on the gpt3 model but has been further trained using human feedback to guide the learning process with the specific goal of mitigating the model's misalignment issues and what are these

### Misalignment issues [0:29]

misalignment issues well language models like gpt3 are really capable but one of the problems is that they cannot really align super well with human expectations some of the problems you can run into when generating text with gpt3 are lack of helpfulness it might not follow the user's explicit instructions or it might hallucinate so that means the model might make up unexisting or wrong facts you might also run into lack of interpretability it is difficult for humans to understand how the model arrived at a particular decision or prediction and finally there could be the problem of generating biased or toxic output and this happens due to the data that it was trained on this language model May reproduced biased or toxic outputs even if it was not explicitly instructed to do so these problems occur because big language models are trained on tasks such as next token prediction or mask language modeling although these tasks allow a model to learn the statistical structure of language they can lead to problems essentially because the model is not capable of distinguishing between an important error and an unimportant error for example in the sentence the Roman Empire mask with the reign of Augustus it might predict began or Ended as both words to score high likelihoods of occurrence even though they have very different meanings and overall these models struggle with generalizing to tasks or contexts that require a deeper understanding of language to overcome

### OpenAI's solution to misalignment [2:02]

these problems the developers of chat GPT use a technique called reinforcement learning from Human feedback and chat GPT is actually the first use case of this technique in a model that was put into production the technique of reinforcement learning from Human feedback consists of three steps but just a quick note here before we get started according to open AI chat GPD has been trained using the exact methods as this other model called instruct gbt but only with slight differences in the data collection setup that's why when preparing this content we took the instruct GPT paper as a reference point

### Step 1 - Creating the SFT model [2:39]

the first thing to do is collecting demonstration data in order to train a supervised policy model referred here as the sfd model for data collection human labelers were asked to create the ideal output given a prompt the prompts are sourced partly from the labelers themselves and partly from requests sent to openai through their API for example for gpt3 this process is slow and expensive that's why the result is a relatively small high quality curated data set that is to be used to fine-tune a pre-trained language model using this data set the developers of chat GPT fine-tuned a pre-trained model from the GPT 3. 5 series even though trained on high quality data the outputs of the sfd model at this stage probably also suffered from misalignment to overcome

### Step 2 - Creating the Reward model [3:27]

this problem instead of asking the human labelers to come up with a bigger data set because that would be a very slow and time-consuming process the developers of chat GPT came out with the Second Step which is making the reward model the goal of this model is to learn a objective function directly from the data how it works is for a list of prompts the sfd model generates multiple outputs anywhere between four to nine labelers then rank the output from best to worst the result is a new label data set where the rankings are the labels since ranking the outputs is easier than coming out with the outputs from scratch this is a much more scalable process this new data is used to train a reward model or RM in which the model takes as input a few of the sfd model outputs and ranks them in order of preference one important caveat to keep in mind is that this model strongly reflects the preferences of the labelers that worked in this project and

### Step 3 - Finetuning the SFT model with the Reward model [4:24]

the third and final step is fine-tuning the sfd model using proximal policy optimization or PPO if you'd like to learn more about reinforcement learning you can check out our video on reinforcement learning here in this step the PPO model is initialized from the sfd model and the value function is initialized from the reward model the environment is a banded environment which presents a random prompt and expects a response to the prompt given the prompt and the response it produces a reward determined by the reward model and the episode ends and finally let's

### Model evaluation [4:56]

talk about how this model was evaluated because the model was mainly trained on human labelers input the core part of the evaluation is also based on human input to avoid overfitting to the Judgment of the human labelers that were involved in the training phase the test side consists of prompts that were not included at all in the training set and the model is evaluated on three high-level criteria first is helpfulness judging the model's ability to follow user instructions second is truthfulness judging the model's tendency for hallucinations or coming out with facts on this point the model is valuated using the truthful QA data set and lastly harmlessness the labelers evaluate whether the model's output is appropriate denigrates a protected class or contains derogatory content for this point the model is benchmarked on the real toxicity prompts and crows-pairs

### Further reading [5:48]

data sets so there is no denying that chatgpt is a very impressive model what do you think are the disadvantages as well as the advantages of chat GPT you can check out our comprehensive blog post to read more about the architecture of chat GPT the shortcomings of this methodology and also to find some recommended readings for this topic thanks for watching and I will see you in the next video