How ChatGPT actually works

6:25

How ChatGPT actually works

AssemblyAI 23.01.2023 34 435 просмотров 717 лайков

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Описание видео

Since its release, the public has been playing with ChatGPT and seeing what it can do, but how does ChatGPT actually work? While the details of its inner workings have not been published, we can piece together its functioning principles from recent research. We are going to examine GPT-3's limitations and how they stem from its training process, before learning how Reinforcement Learning from Human Feedback (RLHF) works and understand how ChatGPT uses RLHF to overcome these issues. If you'd like to dive deeper or learn about at some of the limitations of this methodology, take a look at our comprehensive blog post: https://www.assemblyai.com/blog/how-chatgpt-actually-works/ 00:00 Overall look at how ChatGPT works 00:29 Misalignment issues 02:02 OpenAI's solution to misalignment 02:39 Step 1 - Creating the SFT model 03:27 Step 2 - Creating the Reward model 04:24 Step 3 - Finetuning the SFT model with the Reward model 04:56 Model evaluation 05:48 Further reading What is ChatGPT? ChatGPT is the latest language model from OpenAI and represents a significant improvement over its predecessor GPT-3. Similarly to many Large Language Models, ChatGPT is capable of generating text in a wide range of styles and for different purposes, but with remarkably greater precision, detail, and coherence. It represents the next generation in OpenAI's line of Large Language Models, and it is designed with a strong focus on interactive conversations. How was ChatGPT trained? The creators have used a combination of both Supervised Learning and Reinforcement Learning to fine-tune ChatGPT, but it is the Reinforcement Learning component specifically that makes ChatGPT unique. The creators use a particular technique called Reinforcement Learning from Human Feedback (RLHF), which uses human feedback in the training loop to minimize harmful, untruthful, and/or biased outputs. What is the difference GPT-3 vs. ChatGPT? Large Language Models, such as GPT-3, are trained on vast amounts of text data from the internet and are capable of generating human-like text, but they may not always produce output that is consistent with human expectations or desirable values. The alignment problem in Large Language Models typically manifests as: - Lack of helpfulness: not following the user's explicit instructions. - Hallucinations: model making up unexisting or wrong facts. - Lack of interpretability: it is difficult for humans to understand how the model arrived at a particular decision or prediction. - Generating biased or toxic output: a language model that is trained on biased/toxic data may reproduce that in its output, even if it was not explicitly instructed to do so. What is Reinforcement Learning from Human Feedback? The method overall consists of three distinct steps: - Supervised fine-tuning step: a pre-trained language model is fine-tuned on a relatively small amount of demonstration data curated by labelers, to learn a supervised policy (the SFT model) that generates outputs from a selected list of prompts. This represents the baseline model. - “Mimic human preferences” step: labelers are asked to vote on a relatively large number of the SFT model outputs, this way creating a new dataset consisting of comparison data. A new model is trained on this dataset. This is referred to as the reward model (RM). - Proximal Policy Optimization (PPO) step: the reward model is used to further fine-tune and improve the SFT model. The outcome of this step is the so-called policy model. ▬▬▬▬▬▬▬▬▬▬▬▬ CONNECT ▬▬▬▬▬▬▬▬▬▬▬▬ 🖥️ Website: https://www.assemblyai.com/?utm_source=youtube&utm_medium=referral&utm_campaign=yt_mis_34 🐦 Twitter: https://twitter.com/AssemblyAI 🦾 Discord: https://discord.gg/Cd8MyVJAXd ▶️ Subscribe: https://www.youtube.com/c/AssemblyAI?sub_confirmation=1 🔥 We're hiring! Check our open roles: https://www.assemblyai.com/careers ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ #MachineLearning #DeepLearning

Оглавление (8 сегментов)

Overall look at how ChatGPT works

touche Beauty was actually trained in quite a simple way the researchers took a big language model that could generate natural sounding sentences that did not exactly match what humans wanted and then fine-tuned it with human feedback but of course the details are a little bit more complicated chapter is based on the gpt3 model but has been further trained using human feedback to guide the learning process with the specific goal of mitigating the model's misalignment issues and what are these

Misalignment issues

misalignment issues well language models like gpt3 are really capable but one of the problems is that they cannot really align super well with human expectations some of the problems you can run into when generating text with gpt3 are lack of helpfulness it might not follow the user's explicit instructions or it might hallucinate so that means the model might make up unexisting or wrong facts you might also run into lack of interpretability it is difficult for humans to understand how the model arrived at a particular decision or prediction and finally there could be the problem of generating biased or toxic output and this happens due to the data that it was trained on this language model May reproduced biased or toxic outputs even if it was not explicitly instructed to do so these problems occur because big language models are trained on tasks such as next token prediction or mask language modeling although these tasks allow a model to learn the statistical structure of language they can lead to problems essentially because the model is not capable of distinguishing between an important error and an unimportant error for example in the sentence the Roman Empire mask with the reign of Augustus it might predict began or Ended as both words to score high likelihoods of occurrence even though they have very different meanings and overall these models struggle with generalizing to tasks or contexts that require a deeper understanding of language to overcome

OpenAI's solution to misalignment

these problems the developers of chat GPT use a technique called reinforcement learning from Human feedback and chat GPT is actually the first use case of this technique in a model that was put into production the technique of reinforcement learning from Human feedback consists of three steps but just a quick note here before we get started according to open AI chat GPD has been trained using the exact methods as this other model called instruct gbt but only with slight differences in the data collection setup that's why when preparing this content we took the instruct GPT paper as a reference point

Step 1 - Creating the SFT model

the first thing to do is collecting demonstration data in order to train a supervised policy model referred here as the sfd model for data collection human labelers were asked to create the ideal output given a prompt the prompts are sourced partly from the labelers themselves and partly from requests sent to openai through their API for example for gpt3 this process is slow and expensive that's why the result is a relatively small high quality curated data set that is to be used to fine-tune a pre-trained language model using this data set the developers of chat GPT fine-tuned a pre-trained model from the GPT 3. 5 series even though trained on high quality data the outputs of the sfd model at this stage probably also suffered from misalignment to overcome

Step 2 - Creating the Reward model

this problem instead of asking the human labelers to come up with a bigger data set because that would be a very slow and time-consuming process the developers of chat GPT came out with the Second Step which is making the reward model the goal of this model is to learn a objective function directly from the data how it works is for a list of prompts the sfd model generates multiple outputs anywhere between four to nine labelers then rank the output from best to worst the result is a new label data set where the rankings are the labels since ranking the outputs is easier than coming out with the outputs from scratch this is a much more scalable process this new data is used to train a reward model or RM in which the model takes as input a few of the sfd model outputs and ranks them in order of preference one important caveat to keep in mind is that this model strongly reflects the preferences of the labelers that worked in this project and

Step 3 - Finetuning the SFT model with the Reward model

the third and final step is fine-tuning the sfd model using proximal policy optimization or PPO if you'd like to learn more about reinforcement learning you can check out our video on reinforcement learning here in this step the PPO model is initialized from the sfd model and the value function is initialized from the reward model the environment is a banded environment which presents a random prompt and expects a response to the prompt given the prompt and the response it produces a reward determined by the reward model and the episode ends and finally let's

Model evaluation

talk about how this model was evaluated because the model was mainly trained on human labelers input the core part of the evaluation is also based on human input to avoid overfitting to the Judgment of the human labelers that were involved in the training phase the test side consists of prompts that were not included at all in the training set and the model is evaluated on three high-level criteria first is helpfulness judging the model's ability to follow user instructions second is truthfulness judging the model's tendency for hallucinations or coming out with facts on this point the model is valuated using the truthful QA data set and lastly harmlessness the labelers evaluate whether the model's output is appropriate denigrates a protected class or contains derogatory content for this point the model is benchmarked on the real toxicity prompts and crows-pairs

Further reading

data sets so there is no denying that chatgpt is a very impressive model what do you think are the disadvantages as well as the advantages of chat GPT you can check out our comprehensive blog post to read more about the architecture of chat GPT the shortcomings of this methodology and also to find some recommended readings for this topic thanks for watching and I will see you in the next video

Другие видео автора — AssemblyAI

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник