Large Scale Reward Modeling | Jonathan Ward | OpenAI Scholars Demo Day 2021
14:49

Large Scale Reward Modeling | Jonathan Ward | OpenAI Scholars Demo Day 2021

OpenAI 10.05.2021 3 313 просмотров 87 лайков

Machine-readable: Markdown · JSON API · Site index

Поделиться Telegram VK Бот
Транскрипт Скачать .md
Анализ с AI
Описание видео
Learn more: https://openai.com/blog/openai-scholars-2021-final-projects#jonathan

Оглавление (16 сегментов)

Introduction

hey everyone i'm jonathan ward and i've been mentored by john shulman um over the past six months that amongst i've explored the possibility for large-scale reward modeling um so what this means in practice is how do we learn what people want and then build models that are better able to do that

What should models do

so this uh this question really starts with what should models do so there are a couple of domains we can think about is the formal domain in which specifying the task is really clear very simple um this would include game playing board games video games so a lot of the recent results in machine learning around beating atari or various video other various other video games or chats or go uh have really centered on these domains where it's really easy to provide clear feedback to the model about what to do next but a lot of life is actually really much more informal where specifying what uh correct behavior looks like is much harder to do and this is really the area that i'm going to be focusing on today

Formalization

today so there's really a couple ways to proceed um and i'll in particular contrast these two there's the idea of formalizing what's informal so this means trying to write a function that somehow captures the nuances of problem at hand so this would include things like rouge or blue if you're familiar with those terms in machine learning these are functions that essentially try to measure how good a summary is or how good a translation is but then there's this other approach which is what i'll focus on which essentially aims to understand what's good by simply asking people to compare two things or to rate something so this is the setting of learning human preferences

Feedback

so the tricky part here is actually getting these preferences getting feedback so there's a couple of ways for seating and a lot of excellent work has been done at openai and deepmind as well on this setting of understanding how to use feedback and to train systems to incorporate that feedback and improve their performance um this prior work is really focused on interactive feedback so feedback where you hire contractors and the researchers work with those contractors to make sure there's a common understanding and password has demonstrated they can build accurate models of human preferences using this type of feedback this is however expensive but one potential way forward is to use feedback that's directly available in the internet and this is potentially much less expensive you can potentially uh gather more of this data and you can potentially gather across various tasks so that's what i'm really going to focus on so the operative question is essentially can we train an accurate model of human preferences from feedback that's gathered on the internet

Interactive Feedback

internet and so just to reflect on that a little bit more so with interactive feedback you get some benefits right um you can make sure the contractors or whoever is providing the feedback has a similar sense of preferences that you or the researchers or whatever the gold standard you are for the true preferences um is so you can make sure there's a close match there so in some sense those are much closer the internet feedback you kind of get what's already out there right so if there's a lot of ratings of one thing being better than another better stories um better uh better question answers then that sort of model of what exists and what's preferred is what you're is what you're capable of learning from um so then with this in mind

TaskOriented Feedback

um i really wanted to focus on more structured task oriented uh feedback so a lot of feedback on the united generic where it's sort of scattered across various tasks so like a like on twitter or youtube isn't really responding to a specific task performance on a specific task just as that comment or that video was good um but for these task oriented domains you can actually get a clear answer of whether a certain explanation or a certain answer to a question was good so there's a very clear sense of input output

Reddit

output in particular i'll be focusing on reddit so reddit is the seventh most popular site in the us um and it's organized into subreddits that have particular tasks and particular sort of uh structures in the way that they give feedback so they might value certain things as a community and i'll focus on the community of our slash writing props in particular so this is a community of short story writers and it's structured in the form

Writing Prompts

of a writing prompt so you get a writing prompt if you're trying to respond to one of these things it looks like this a small dragon must defend his hoard a single coin and then people will provide various responses to that writing prompt and each response will get some number of uh upvotes down votes and together they'll produce a resulting score so these scores actually reflect some measure of the aggregate preferences of the people on our such writing props so we can try to learn a model of those preferences using these scores

generative model

and in particular there's a few models that i want to train so the first model is essentially the generative model this is the model that takes writing prompts as an input and produces a response so this is somewhat analogous to someone who's browsing the subreddit and writing and then the next

evaluative model

model is the evaluate model so this would be analogous to someone who's lurking and provides like an upvote or downvote um on these stories so the evaluative model gets the prompt it gets uh two responses to that prompt and simply outputs which of the two responses is better

agent model

and then the last model in this sort of system that i'm considering is the gameplay model the agent model so this agent model starts off from the generative model so it's just something that learns to produce stories that are similar to the stories that it's seen but it's further trained using the feedback it gets from the evaluative model so the value order essentially provides feedback to one of two agents that are playing against each other and it provides that indication the form of saying which one of the stories is better so there's a sequence of models

model sequence

that i train here i start with pre-trained models these are widely available and provide a great starting point for doing experiments um i train the generative model i trade the evaluative model and i combine the generative model and validate model to produce the agent

training

and then ultimately i can take those outputs and i can make them available to the public to you all um for example rewardmodeling. com which is a website i built for this project where i can actually gather some of your feedback on whether the output of this model actually matches your preferences

results

so the um the most important result is really how well does this reward model generalize or how well does it actually capture um our preferences our model preferences um so in order to assess this we essentially train this large model on some number of comparisons and then we test it on a set of comparisons that it hasn't seen before then in particular for this project i wanted to make sure that the model wasn't learning anything spurious wasn't learning um how preference is based on how long the responses were wasn't learning these preferences based on how quickly they were made so i removed several confounders and i filtered down to a hard test set of roughly similar responses that were made at roughly the same time and uh with this i actually received a final accuracy model accuracy of 74. 2 so to place this number in some context right it's worth thinking about there's some inherent sort of noise um in sort of preferences these are preferences for a lot of people these are preferences that are gathered across 10 years of reddit data so they might vary over time um and to understand this closer let's take a look at this graph which looks at how this accuracy changes as we increase or decrease the model size number of examples it's seen so this on the x-axis this is the number of samples in the training set so this is a number of examples that basically sees before it does the test right and then on the left you can see the accuracy which is its performance on this test set so um i'll draw your attention to the performance of gpt 2 xl which in particular is the largest model that i trained um so you can see here that it actually learns the fastest of all the models but then it essentially saturates at this performance of around 74. 2 percent um and there's a couple things that we can think draw from here um one is that larger models simply learn faster they can extract more meaning from the data that's given to them um two that there are continued gains from uh from increasing the number of samples that the model sees um but this is most pronounced for the smaller model sets um and then with this in mind it's really interesting to think about what happens if we combine data sets um across different subreddits what happens when we start to explore things like transfer and this is really where i want to take this project next um so i think the basic idea is what if we trained this reward model a lot of different subreddits tasks basically and tested its performance on a task that hadn't seen before so many ways this more closely captures what we want out of a model in reality um for a lot of tasks we won't have the vast amount of feedback that comes from reddit for a lot of tasks we won't be able to actually gather expensive human feedback um so what we really want is a reward model that's been trained on pre-existing uh signals and then it actually performs well on some evaluation that it hasn't been prepared for

longterm direction

and along these lines uh you can think of like a long-term direction for this field sort of being like gathering feedback from the internet is almost analogous to pre-training um in the common sort of language modeling setup and then this interactive feedback that people gather is analogous to fine-tuning um in the sense that we can more carefully construct this set of feedback uh we can hire contractors with particular expertise or we can make sure that we have a broad uh reflection of all the feedback from different people groups that we care about and that way we can make sure that the true preferences are more actively captured while also receiving some performance benefit from using this internet feedback and with that

cautionary note

i'll end on a bit of cautionary note that um so like what preferences have we learned so rough reddit isn't really representative of the globe right it's uh it's skewed in many ways right so if we actually wanted a model that maybe represents um a very general sense of what we mean for something to be a good story or not we're going to have to balance out this uh this data set right um and that will be important going forward we're going to want people with more expertise in writing and we're going to white people with a lot of different influences um so with that i'd like to thank my mentor the other scholars especially sam and danielle who donated lots of compute um i'd like to thank the organizers for making this all possible and then i will answer questions all right so how would you get rid of any concern of bias on reddit so um i'll answer this one live so the i think the issue is you simply can't um i think the way to approach that is to actually balance it with getting other data sets um like one thing that i did do is i filtered out explicit text um but i think there's a lot there are a lot of issues there so i think the future of this is probably balancing the um the internet feedback with a more created uh data set of feedback do i think the 75 accuracy ceiling is due to the noise and the labels or weakness of the models um i'd say it's potentially a bit of both right um i would be very interested in seeing if uh if a larger model with more capacity can move past that but it does look like a lot of the models like the various size models were converging around 75 accuracy um so i actually think it's probably mostly noise in the labels um and i did notice that when i actually uh added the data of um of when the response was made the delay between the submission the response to the uh to the language model input the language model is actually able to further increase its accuracy so it was able to take into account both the text itself and the speed of the response um and with that i think i'm at time i'd love to answer any other questions in the future but that's it for me

Другие видео автора — OpenAI

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Дайджест Экстрактов

Лучшие методички за неделю — каждый понедельник