Learning to summarize from human feedback (Paper Explained)
45:30

Learning to summarize from human feedback (Paper Explained)

Yannic Kilcher 07.09.2020 20 905 просмотров 690 лайков

Machine-readable: Markdown · JSON API · Site index

Поделиться Telegram VK Бот
Транскрипт Скачать .md
Анализ с AI
Описание видео
#summarization #gpt3 #openai Text Summarization is a hard task, both in training and evaluation. Training is usually done maximizing the log-likelihood of a human-generated reference summary, while evaluation is performed using overlap-based metrics like ROUGE. Both significantly undervalue the breadth and intricacies of language and the nature of the information contained in text summaries. This paper by OpenAI includes direct human feedback both in evaluation and - via reward model proxies - in training. The final model even outperforms single humans when judged by other humans and is an interesting application of using reinforcement learning together with humans in the loop. OUTLINE: 0:00 - Intro & Overview 5:35 - Summarization as a Task 7:30 - Problems with the ROUGE Metric 10:10 - Training Supervised Models 12:30 - Main Results 16:40 - Including Human Feedback with Reward Models & RL 26:05 - The Unknown Effect of Better Data 28:30 - KL Constraint & Connection to Adversarial Examples 37:15 - More Results 39:30 - Understanding the Reward Model 41:50 - Limitations & Broader Impact Paper: https://arxiv.org/abs/2009.01325 Blog: https://openai.com/blog/learning-to-summarize-with-human-feedback/ Code: https://github.com/openai/summarize-from-feedback Samples: https://openaipublic.blob.core.windows.net/summarize-from-feedback/website/index.html#/ My Video on GPT-3: https://youtu.be/SY5PvZrJhLE My Video on GPT-2: https://youtu.be/u1_qMdb0kYU Abstract: As language models become more powerful, training and evaluation are increasingly bottlenecked by the data and metrics used for a particular task. For example, summarization models are often trained to predict human reference summaries and evaluated using ROUGE, but both of these metrics are rough proxies for what we really care about---summary quality. In this work, we show that it is possible to significantly improve summary quality by training a model to optimize for human preferences. We collect a large, high-quality dataset of human comparisons between summaries, train a model to predict the human-preferred summary, and use that model as a reward function to fine-tune a summarization policy using reinforcement learning. We apply our method to a version of the TL;DR dataset of Reddit posts and find that our models significantly outperform both human reference summaries and much larger models fine-tuned with supervised learning alone. Our models also transfer to CNN/DM news articles, producing summaries nearly as good as the human reference without any news-specific fine-tuning. We conduct extensive analyses to understand our human feedback dataset and fine-tuned models. We establish that our reward model generalizes to new datasets, and that optimizing our reward model results in better summaries than optimizing ROUGE according to humans. We hope the evidence from our paper motivates machine learning researchers to pay closer attention to how their training loss affects the model behavior they actually want. Authors: Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, Paul Christiano Links: YouTube: https://www.youtube.com/c/yannickilcher Twitter: https://twitter.com/ykilcher Discord: https://discord.gg/4H8xxDF BitChute: https://www.bitchute.com/channel/yannic-kilcher Minds: https://www.minds.com/ykilcher Parler: https://parler.com/profile/YannicKilcher LinkedIn: https://www.linkedin.com/in/yannic-kilcher-488534136/ If you want to support me, the best thing to do is to share out the content :) If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this): SubscribeStar: https://www.subscribestar.com/yannickilcher Patreon: https://www.patreon.com/yannickilcher Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2 Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

Оглавление (11 сегментов)

Intro & Overview

hi reddit my boyfriend and i have been dating for a year and it has been great except for one thing dota the other day on a saturday i was over and he was playing a game i thought it would just be one but instead he proceeded to play for three hours as i just sat there what can i do so this as you can see it is a post from a subreddit called relationships of someone seeking relationship advice now i would claim that this is clearly fake because no one plays dota for just three hours crazy but let's assume that this is a thing that really happened and well it doesn't matter the article here is written and the task is to summarize this post in as few tokens as you can but sort of giving much of the information that's in the that is in the post itself so the task here is called summarization and humans can do this quite well so here you see a human written uh reference pr uh baseline my boyfriend games whenever he can how can i get him to stop gaming so much and focus more on school and our relationship okay so that's a pretty good summary of what goes on in this model the most the easiest bass lines for this task in machine learning are what's called extractive bass lines so in extractive summarization what you do is you try to find sub spans so let's say like this span followed by this span and so on um that together represent the article so you strictly select subspans or even entire phrases from the text that you're looking at so a lot of these baselines are extractive and they perform already fairly okay for example uh this one right here help my uh my boyfriend is neglecting his studies and our relationship because of a video game i think that's just extracting from the title okay that's title policy there are other models for example here the sleep too hi reddit my boyfriend and i have been dating for a year and it has been great i mean that accurately represents maybe not maybe that's not so you can already see that it's quite hard um because not only does a model have to understand what information is in the text and what are the important things but also clearly it needs to understand something about the intent of the post right um you wanna if you wanna compress you have to compress the meaning and the meaning because we are humans we understand that this person here is distressed um seeking advice right it's like what should i do and we understand that the source of the frustration is the fact that the boyfriend here plays a lot of this video game it's not really important you know how much they played or even that they've been dating for a year or so on the problem here communicated is the playing video games so you see that the researchers here have come up with a bunch of models and their best model that we're going to look at here is called this human feedback model with 6. 7 billion parameters it's a gpt style model and we'll get to all of this in one second i'll just want to kind of show you the end result that can output the following my boyfriend is neglecting his studies and our relationship because of his excessive gaming of a video game what can i do to get him to stop all right so there are a couple of nuances here like the what can i do to get him to stop is not really explicitly said in the text it says it seems like it interfered with our relationship he's doing his phds obviously swamped um it goes on the back burner it makes me rethink our relationship and so on it these things aren't explicitly said yet the model somehow understands that that's what this person expresses and if you want to compress this then this information then this is a very good thing too summary to output so we'll go to see how they come to build this model what it has to do with human feedback and um just in generally how it works and also where it fails so this is a pretty big paper as you can see it's one of those papers where the appendix needs a table of contents which is going to come up very shortly very this there was um lots of references so it's a paper by open ai um of course recently openai has made big advancements in language research with gpt three and this is from kind of the same style of research so the paper is called learning to summarize from human feedback by nissan stinon long young daniel m ziegler ryan lowey chelsea voss alec radford daario amodai and paul cristiano as i said of open ai so they tackle this task of summarization of this of these kind of posts or news articles you can apply this pretty much anywhere and they incorporate human feedback into it now why do and that's because

Summarization as a Task

summarization isn't a straightforward task right so in its basic if you have a summarization task you have some sort of a piece of text that contains some information and from this you want to generate a small piece of text the should be first very short but second also it should contain information it should contain all the information that was contained in the original article maybe not all of it but it should contain the important information of what is in the article and then there are some other things like it should also be coherent but i think that's sort of implicit in this information objective what you want to do is if someone reads this piece of text they should get all the information that was in the big text or not all but most or the important information humans are quite okay at this but it's not like we can really formulate exactly what we want right it's not like we can give a classification label and then tell the machine exactly look this class is correct and these other classes are wrong now what people have been doing is they've built data sets where you'd have for one particular document you'd give it to let's say three different humans and the three different humans would produce three different summaries because different humans do it differently right so you'd provide three different summaries and then you let your machine learning model produce some summary and then your evaluation metric would be an a metric that takes this piece of text and compares it

Problems with the ROUGE Metric

to those pieces of text and this one of these methods here is called rouge so rouge is a metric that looks at ngram overlap so the wikipedia page pulled up here and you can see it consists of a bunch of sub metrics but there is a way to mix them but in their essence they basically look at overlaps of here overlap of n grams so you can look unigrams or bi-grams you can look longest common subsequence and so on basically you sort of try to compare the words the text specifically in here to the texts in the human summaries and given the rich nature of language that's not really a good approach but it's the best one we have like we don't have a better metric to tell the machine what's right or wrong and it goes actually further so this rouge as an evaluation metric it's already it's fairly bad as we can see as we will see they have a graph somewhere and i might just draw the graph in that if this here is kind of the complexity of the information and this here is the how good the summary really is as rated by humans so this paper plays a lot of emphasis on going to actual humans and asking them how good is a summary if you employ rouge then at the beginning you increase as you increase the quality so for easy text for easy information um and for really bad models the rouge metric makes sense because you know generally if you have a very crappy model and one that just outputs the same kind of text as the humans do then that one's gonna fare better but then at some point it wanes off and the at some level of complexity coherence and so on the rouge metric is just not good enough anymore to differentiate sorry to differentiate good from bad summaries or let's say to differentiate um excellent from good but not excellent summaries let's phrase it like this it's bad it's good at differentiating bad from good summaries but not good from excellent okay so that's one thing that's evaluation but rouge this overlap of n grams you can imagine that this is not differentiable so the second problem is how do we even train this thing right so this here is this is eval

Training Supervised Models

rouge eva but in training you do something even less um let's say something even that makes even less sense from a just a principled point approach what you want to do is you want to simply make the machine output these texts right so you simply say these texts are correct now please output those it's kind of like a variational auto encoder that you wanted to output a very specific picture but you've given it that picture as an input you can kind of imagine it like this you say this is the input and this is the output i want you to produce and now that i can actually back propagate i can back propagate the production of this exact text from this input right so their model here is going to be some sort of a gpt3 style model it's not as big as gpt3 their biggest model i think is 6 billion seven billion parameters whereas gpt3 has what 175 billion parameters or something like this so the model is going to work as follows you take this text here you just unroll it i think some like this so that it's just one string and then you let the model produce so here's the model is on top of this and you simply always produce the next character or word piece right here and then you produce the next and until you've output this thing here and this thing here is going to be the summary okay and that's a thing you can back propagate through with simply language model learning i'm bragging a bit too much because of course many things are trained like this in language learning like translation is learned like this just these simple generative language models are learned like this so it's not that terrible but you can see that evaluating with rouge while training with this both are not particularly suited to what we want actually is that humans would rate these summaries well but we can't do that and that's the

Main Results

problem that this paper solves so here they show their final results already so down here you have model size but we don't worry about that right now that because there's also a question of scaling here and so on if they use a language model that was just pre-trained on language so no terrain no explicit training for summarization we've already seen in the gpt2 and gpt3 paper that if i take a piece of text and da and i append the string tl dr right too long didn't read which in forum posts most often people put this and then they put a summary okay so this prompts the model to produce a summary if this seems mysterious to you i've made videos on gpt2 and gpt3 explaining how this works so a model that has just been trained on language modeling will actually be able to do summarization to a certain degree as you can see right here it's still below the quality of reference summary so this axis is really what humans this wow that body attachment to the legs um is really what humans think of these summaries so the way they evaluate it is they present the human with two different summaries they ask them which one do you prefer of course if you give them human summaries um so one of them is always a human summary but if you give them two human summaries it's of course random which one they prefer and therefore that's uh the 0. 5 point so if you give them one summary from this pre-trend model and one human summary you can see that the pre-trained summary loses most of the time loses like 80 70 to 80 percent of the time against the human reference summary then the second step is to take this model and produce what they called a supervised baseline so that's what we've discussed just now when we said how do we even train this so we take a model that takes a database uh sorry a data set i've been some reviewers are just calling data sets databases and it freaks me out and i've taken it over i've seen it so many times now there must be parts of the world where data sets are called databases so in this you always you have samples of um text and corresponding summary so you call this your x and y and you simply train a model to take in the x and predict the y now instead of a class label it's simply a string a piece of output string you can do this with a language model like a generative language model that's a that's the supervised baseline so if they do that they get closer as you can see right here so there is quite a bit of distance between this pre-trained model and the supervised baseline that starts from the pre-trained model but actually trains the model to do summarization but you're still not at the level of these reference summaries and then they have this mysterious human feedback model that now all of a sudden actually gets better than the reference summaries it actually uh outperforms them and we're going to look at how this comes about so first of all their contributions as they stated they say we show that training with human feedback significantly outperforms very strong baselines on english summarization okay we show human feedback models generalize much better to new domains than supervised models okay and we conduct extensive empirical analyses of our policy and reward model all right so if you see the words policy and reward model that already means that reinforcement learning is going to play

Including Human Feedback with Reward Models & RL

some role here and here's how it works so this all already starts from the supervised model so imagine what you've done so far you have this pre-trained model you've taken it you've generated a supervised model for it so the supervised model is explicitly trained to do summarization but just on a data set and now you want to incorporate human feedback okay so the way you incorporate human feedback is as follows first you collect the human feedback and the human feedback here you could do various things so you could let the humans kind of score summaries but what you want to do in this case is you always want to present the human with two different summaries and ask them which one do they prefer okay that's going to be our humans are going to be just doing this thing for now they are going to look at two summaries and the corresponding piece of text that's important and they're going to decide which summary is better and better in just in a human sense better right so they they work closely together with the researchers right here and that's i think an advantage if you're open ai and have lots of funding and so on they it's it appears they've paid these humans quite well and they've worked with them quite closely to in order to ensure the high quality of their feedback so the humans will always say which of these two summaries is better okay now what you could imagine is you could simply train a model using that right so the model it produces this and maybe the human one of the humans some reason the data set is that and then the human decides is it better or worse and then a model somehow optimizes this is not exactly what they do because that would require too many humans if you know these language models they take a lot of data so even though openai has lots of budget it's not really feasible for them to train these big language models and every single training step for every single sample go and ask a human what do you think so they have to come up with some sort of different way to do this so what they do is this entire thing right here oops will now be a data set okay it will be a new data set so they take this supervised model and they produce a whole bunch of these summaries and they always ask the humans which one's better so this will be a data set and the sample from this data set will consist of a big text two summaries of that text and it doesn't really matter how they're generated just two summaries and a label and the label is either this one's better or this one's better okay so this here is going to be now our x and this one is going to be our y of that data set and to this data set we now fit a model so we fit a model to simulate the human okay we the model learns from the human in re in reinforcement learning this is very related to imitation learning reward model learning um there are a bunch of names for it in this case they say we train a reward mode it's actually not exactly sorry it's not exactly limitation learning because that there you'd have actually samples of the policy and so on so let's stick with reward model learning so that i'm correct the exact way you do this is you don't actually fit the x to the y right here but what they train is this reward model right here so this thing takes in as you can see a piece of text and one summary and it predicts a number and the number is supposed to say how good is that thing summary for that given document and the humans never said that right so we can't directly we can't directly use this as a label right here we cannot because we don't have this information we just have the information whether it's better or worse than some other thing so what we're going to do is we're going to take the same article and a different summary of the um of that poster one post with two summaries judged by a human are fed to the reward model so this is fed to the same reward model the same model gives at the output for that one and then we train our loss is going to consist which one's better so the loss is pretty simple right here you simply subtract them from each other this is a sigmoid non-linearity and the log because the loss is in log space but the sigmoid right here ultimately what that does is if so here's zero if post j is better than post k this is going to be a positive number right so the sigmoid will map this to a 1 over here if post k is better than post j the sigmoid will map it to a 0 right here and if they get close to 0 then something like this right so in this case here post j is better and k is better so that seems like a sensible loss that you can regress on so now you map these rewards to a zero or a one and that's exactly what your label is either a zero if this post is better or a one so now you have a data set and you have a model that you can train namely this model right here so you're going to train this reward model on this data set and you can iterate this at the end even though we aren't at the end yet you can go back and do it all over again if you want and i think they do they iterate this improving their summaries asking the humans again training reward model and then the last part is that you actually now you have a reward model right remember we said it was too expensive for humans to always go ask the human which one do you prefer well now we have a model that can substitute the human so what we can do is we can simply trade use reinforcement learning to train the summarization model to maximize the reward okay so now we give the model this model right here we give a piece of text and it produces a summary remember this these models are exactly that these models right here are exactly these models okay in fact we start from the supervised baseline we plug this in here that's the model that actually produces the summary and we are going to fine-tune that using reinforcement learning now ppo proximal policy optimization is a pretty simple but very effective reinforcement learning technique so what you need is you simply need an input this your x then you need an action this going to be our action this is going to be our output of the model and then you need a reward so for the reward you take this model right here and this at this point this is fixed so you learned your reward model now this is fixed now you have a model that for each summary can give you how good that summary is right this reward and you can use that to do reinforcement learning so the reinforcement learning simply tries to generate a summary that makes the reward model as happy as possible and the reward model is learned from the humans so you can see that at the end through the proxy of the reward model we are directly training for human uh human enjoyment so we are not training log likelihood like we did initially in the supervised baseline we are not training for rouge which we could do with reinforcement learning but rouge itself is a pretty bad metric we are actually training for directly for what humans say they prefer at least as far as the reward model can approximate the human preferences so you can see that this is potentially a good approach now um this was also kind of if you read this stuff in let's say on twitter or elsewhere people are i think very joyous that wow we are aligning um models with human interest we are aligning them with human preferences and so on human in the loop yeah yeah um it's still

The Unknown Effect of Better Data

difficult i think this is slightly over hyped in that direction like the direction of where we go say wow these are uh so these are so such good things because so first of all um this costs a lot of money like you need to work closely together with these humans right and um i don't know where they say it but they actually did not compare to a model that collected so if you do this supervised thing right here you have your data set right of text and multiple reference summaries well okay no one knows what happens if you invest as much time money and effort into collecting a bigger data set of simple reference summaries and then training a supervised model on that nobody knows okay so and they say this they admit this in this um in this paper they say we did not it's too expensive to also just do the control of what would happen then but you know chances are that models are going to improve significantly as well if you simply provide a bigger data set um of of these okay so i yeah it's questionable whether or not this modeling of the reward here is really the deal breaker or simply the fact that they have collected much more and much higher quality data to train on and then the reward model is simply the proxy for that data so that's the first kind of um dent here that's not really clear now oh don't get me wrong this paper is pretty awesome especially because they evaluate all the summaries using humans as well and that costs a lot too so regardless of training even evaluating these summaries in terms of not rouge but actual human feedback is very expensive and they do this as well and this is of course pretty awesome and gives you the most accurate signal that alone is commendable but i don't don't believe yet that this reward modeling is

KL Constraint & Connection to Adversarial Examples

the thing that made the improvement here in their training procedure the second thing is they do the following their reward for the ppo algorithm isn't actually just the reward from the reward model as you can see here but it has this kl term in here so what does this kl term do so here is the this is the supervised baseline is simply a model that as we said was trained to input it post and output one of the summaries that the humans provided this thing right here is the reinforcement learned baseline so this is the thing that's actively changing during ppo okay so and you constrain this to be to stay close to the um to the supervised baseline so you don't want your reinforcement learned model to go far away from the supervised baseline model so in terms of the reward your reward is going to be um the reward that you get from the reward model that is trying to predict how good humans like the particular thing uh minus a penalty so mine is a penalty term if you are too far away from the supervised baseline and this should remind you of something so you're kind of trying to optimize the you're trying to especially if you look at the diagram of the model right because you have a piece of text right and then you have your model right here that you train and then you have the output summary okay reward model and as an output that you're trying to make as big as possible now what does that remind you of if you look at this model right here you're trying to optimize its input right this is the input to that model in order to make its output a certain way while all the while making the input be not too far away from some reference input this should remind you of adversarial examples all right because what's happening right here is exactly we are trying to find an adversarial example to the reward model okay it's not adversarial in the sense that it tries to maximize its loss or something like this but it is trying to maximize its output its reward and it's trying to manipulate the input to the reward model such that the reward is as high as possible and what do we know about adversarial examples um is that they aren't really part of the normal data spectrum if you will so and we're going to see this and they have this this problem as well so if they uh constrain they there is a parameter there where you can trade off how close you want to stay so how much freedom do you give the reinforcement learning to go away from the supervised baseline and you can clearly see that here is the fraction preferred by humans and here is this kl this kl if you optimize with reinforcement learning and you let the reinforcement learning you know you give it some room the more to the right here the more freedom the reinforcement learning model has you can see that it goes up and up but after a certain while it is flat and actually goes down again so if you purely reinforcement learn what you really find are adversarial examples to the reward model that have nothing to do with the humans anymore because it's really just an adversarial example and to demonstrate this they have this nice piece in the appendix where they give samples from these over-optimized policies so policies that are just over-optimized to this reward model so here and we don't see the piece of text which i find is also interesting because here we are just um the reader of the paper can is just tasked with judging without i think without finding the piece of text without reading the piece of text which is interesting that humans can actually do this makes you kind of think of how it all works but so here the reference summary that a human wrote was i'm 28 may live in san jose i would like to learn how to do gymnastics okay um 20 or year old dudes stubbornly post ponies start pursuing gymnastics hobby citing logistics reason despite obvious interest question or question my question mark um it so yeah negatively affecting long-term fitness progress personally it just seems like a bunch of these websites that people made to rank high on google because it has all the terms that make google happy which i mean this something like this is exactly happening here right you just trying to fit everything in there to make the reward model happy the reward model was only ever trained on let's say coherent summaries textual summaries so if you go away from this data manifold you can find things that score high but that a human wouldn't rate high that's simply because the reward model isn't you know it's all isn't all knowing it's simply a neural network and they are susceptible to adversarial examples left password saved on work computer replacement spends every hour of the day watching netflix uh employees stubbornly postponing his replacement so it uh if by trying reasonable question or question negatively affecting productivity you can already see um that there is some sort of a pattern here negatively affected um so this this policy simply finds like this structure of text um stubbornly postponies that seems to make the reward model very very happy but it really goes away from the text right here i get it's pretty cool actually because you see my fridge and that it kind of copies over the words in what it already knows it makes sense and i think this ties a lot into what i've been saying about how gpt3 works because this is kind of a really dumbed down version of gpd3 it's actually the same architecture and you can pretty clearly see that what it does is interpolate different things so in this case it interpolates what it knows makes the reward model happy which seems to be this phrase right here and it interpolates the kind of important words from the text on the left a little bit so um it sort of understands what makes the reward model happy and thereby you can already see how a reward model like this may work in that it will sort of judge the it will judge whether or not some of the words are present right here and that's 100 percent due to the reward model i think not being trained on you know sentences like what we've just seen because even the supervised baseline the summaries are going to be pretty okay and the especially the human reference summaries are going to be pretty okay for the most part they're going to already be coherent they're going to be linked linguistically correct grammatically correct and so on so it just never seen that space of data right if we scroll back through the this giant mess right here this is already um it's already the paper basically so after implementing this particular reward you can see that they now have a handle right here on how much the rl is supposed to go away from the supervised baseline if they simply constrain this to some reasonable degree then um the reinforcement learning seems to improve the uh summaries okay so the results here

More Results

the summaries okay so the results here are you've already seen i think the main results in that they are pretty good especially you can see this in they also ask the humans to rate summaries in different uh kind of in different areas and you can see that the reference summaries are always or most of the time better than the supervised baseline and also the pre-trained only models yet the human feedback models they outperform the reference summaries which is you know it's pretty cool because you'd think that humans would be sort of very good at this stuff but the human feedback you can think of it as kind of emulating an ensemble of humans so the reference summary is just a single human writing a summary and the human feedback is optimizing a model that kind of tries to integrate all of the human summaries that exist from a particular of a particular post of course it would be interesting to see of how diverse the summaries would be i believe they have some experiment where they sample with different temperatures but still maybe there's trade-off with diversity here that it always goes for the best one and um they make do a lot of experiments i don't want to actually get into they also transfer this to this news data set so simply trained on reddit but then transfer it to the news data set which it works pretty well as you can see right here so it works almost as well as a supervised baseline that was directly trained on that data set and that's fairly cool so i definitely think that there is a value and the criticism of rouge definitely is warranted um also the question of how we train with different with things such as summary where we can't even really formulate what we want like there's a trade-off with length as well um the incorporation of human feedback is very

Understanding the Reward Model

valuable so the last part they do is understanding the reward model they ask themselves what what does the reward model actually learn and this is where i'm a little bit disappointed in here though this is very valuable right the fact that um they show that if you let it go too far if you optimize only for the reward model you fail they also do investigations into model size and how much data you need and so on they change a little bit the things which i this okay this is pretty cool where they say we construct an additional validation set by having labelers make minimal edits to summaries to improve them our reward model our reward models prefer the edited summaries almost as often as a separate set of human evaluators so the reward models can sort of spot when summaries improve and so on they do a lot of validating that the reward models are actually in line with human preferences however as we see if you directly optimize for the reward model if you are allowed to go away from the data manifold of valid summaries then anything can happen and that's the danger with incorporating reinforcement learning right here you can also see they're clearly better than humans so here are the these curve that i draw at the beginning for these reward models whereas the rouge as you can see it just flattens out after a certain complexity um what they don't investigate what would be really interesting um is just something that i would find interesting is how much the reward model actually depends on the input post because it seems like you could you know trade off information in the input post and coherence and so on by looking at what happens if you actually change the input post does it matter a lot how much does it matter and so on so this it would be fairly cool to look at especially given that we humans can apparently look at these summaries and judge them fairly well by just looking at the summaries of course we have no clue what the article said uh yeah

Limitations & Broader Impact

all right so here they discussed some limitations and they're of course very open about the limitations right here you know it's extremely skill intensive time consuming to produce good ones and expensive so yeah the last thing here is the broader impact statement and they of course go through the uh full trifecta of broader impact statements which again to repeat so you have to do this you have to so here is you and you take you take your hand and you go like you know that the catholics go you touch here you touch here or the shoulders here and you say the magic words are technology good technology bad technology biased okay so what you want to do is it's technology which is a metaphor that broader impact statements they never actually deal with the exact method in the paper they always go like up one layer or two and of course the extreme is technology so you don't want to talk bad about your technique because my god your technique isn't bad is it um so you just go up and you say whatever language models can be bad or good or machine learning can be better or technology now first you say it's a it's good right so many potential positive effects of aligning machine learning algorithms with the designer's preferences and um again i think this is a bit overhyped this aligning because we clearly see that the way they do it if you align too much it is misaligned again ironically then bad so unfortunately our techniques also enable malicious actors to more easily train models that cause societal harm yes that's the technology bad part and you can see for instance one could use human feedback to fine-tune a language model to be more persuasive and manipulate humans beliefs so we're talking about language models we're not talking about e-summarization here in this particular case we're talking about language models so that's the technology part and then technology bias so you can pretty clearly um predict that there's going to be a part that is something like there you go however since the data set consists of user submitted posts with minimal moderation they often contain contents if offensive reflect harmful societal biases this means our models can generate biases or offensive summaries as they have been trained to summarize such content at least this is actually about you know summarization the model in question right here so props to that but if you ever write a broader impact statement the uh holy trifecta of broader impact statements must apply and you're good all right that was my thoughts for this paper a bit of rambling look at the paper look at the appendix look at the code that they've released i believe they've even released this small model they have a one billion parameter model i don't want to promise too much but yeah they have a lot of appendix a lot of experiments right there and check out openai with that was it for me bye-bye

Другие видео автора — Yannic Kilcher

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Дайджест Экстрактов

Лучшие методички за неделю — каждый понедельник