ORPO: Monolithic Preference Optimization without Reference Model (Paper Explained)

33:25

ORPO: Monolithic Preference Optimization without Reference Model (Paper Explained)

Yannic Kilcher 01.05.2024 25 729 просмотров 664 лайков

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Описание видео

Paper: https://arxiv.org/abs/2403.07691 Abstract: While recent preference alignment algorithms for language models have demonstrated promising results, supervised fine-tuning (SFT) remains imperative for achieving successful convergence. In this paper, we study the crucial role of SFT within the context of preference alignment, emphasizing that a minor penalty for the disfavored generation style is sufficient for preference-aligned SFT. Building on this foundation, we introduce a straightforward and innovative reference model-free monolithic odds ratio preference optimization algorithm, ORPO, eliminating the necessity for an additional preference alignment phase. We demonstrate, both empirically and theoretically, that the odds ratio is a sensible choice for contrasting favored and disfavored styles during SFT across the diverse sizes from 125M to 7B. Specifically, fine-tuning Phi-2 (2.7B), Llama-2 (7B), and Mistral (7B) with ORPO on the UltraFeedback alone surpasses the performance of state-of-the-art language models with more than 7B and 13B parameters: achieving up to 12.20% on AlpacaEval2.0 (Figure 1), 66.19% on IFEval (instruction-level loose, Table 6), and 7.32 in MT-Bench (Figure 12). We release code and model checkpoints for Mistral-ORPO-α (7B) and Mistral-ORPO-β (7B). Authors: Jiwoo Hong, Noah Lee, James Thorne Links: Homepage: https://ykilcher.com Merch: https://ykilcher.com/merch YouTube: https://www.youtube.com/c/yannickilcher Twitter: https://twitter.com/ykilcher Discord: https://ykilcher.com/discord LinkedIn: https://www.linkedin.com/in/ykilcher If you want to support me, the best thing to do is to share out the content :) If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this): SubscribeStar: https://www.subscribestar.com/yannickilcher Patreon: https://www.patreon.com/yannickilcher Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2 Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

Оглавление (7 сегментов)

Segment 1 (00:00 - 05:00)

hello there today we're looking at orpo monolithic preference optimization without reference model this is a paper by researchers of kaist AI and largely deals with alignment aligning language models and instruction tuned models in some way we'll dive into what that means also I know there is a paper by meta released today that deals with another preference optimization uh topic so while this isn't that this should be fairly at least related so what do we mean by alignment can have many different meanings especially nowadays it can mean basically whatever you want but usually things go something like this you have a language model and that language model is just a next token predictor right just predicts next tokens but what people want is some sort of assistant model something that will take an instruction and follow that instruction and that's what you do usually with a in the first instance with a technique called supervised fine tuning which essentially means that you do have demonstrated or some sort of labeled data if you will so it's supervised learning while in language model you really just have text that you scrape from websites in supervised fine tuning you will have something like okay there is an instruction and then there is an appropriate answer and then maybe it's multi turns you have the next instruction and you have the appropriate answer this here you can either sort of approximate take a proxy for it like scrape certain Reddit forums or cor forums or something like this or you can actually collect it from humans like uh open AI has done for their initial instruct GPD papers or we have done for open assistant or you can generate it synthetically or there are all kinds of ways to collect that data but what we always have is we always have some sort of an input X which would be like the instruction in that case and then some sort of an output y so you do have demonstrated labeled data for that step however there is a third step that is sometimes involved uh here which is in the classic sense that's the kind of rhf uh methods nowadays there are many more methods for that but what you will end up with a supervised fine tune model you'll end up with or with supervised fine tuning you end up with a model that so here is my model there's a box um when I input so I can input stuff and it will give me some sort of an output now alignment refers to the fact of making these outputs of the models more in line with what we expect and notably um making the undesired output at this stage less likely so what is going to happen here is that we input a whole bunch of things maybe even the same things multiple times and we'll get out a whole bunch of outputs so let me why so y had one y hat two why had three and so and maybe now we ask we have some sort of assessing which one is better which one is worse or which one is good which one is bad so maybe the humans rank them in order of preference or what usually happens we train a reward model that does it for us so the reward model will say okay this is like very good uh this is good and this here that that's bad and the process of alignment refers to the fact that even though we have a model that you know is quite good at language even though we've actually trained it to sort of follow instructions we can still improve upon that by making outputs like answers to instructions that we agree with in some way that doesn't need to be like agreement in an ideological sense just that we like and that preference rating I deliberately don't tie it to anything like you can tie it to Quality ideology you can tie it to whatever you want right whatever criteria you want answers that you approve of to go up in likelihood and answers you disapprove of to go down in likelihood this is notably different from supervised F tuning which uh takes which in a supervised way it goes from

Segment 2 (05:00 - 10:00)

an instruction to an answer to an instruction right so different kind of data here you have input output Pairs and here you have different outputs and you want to make some more likely and some less likely so this and yeah so this here sort of is this process largely referred to as alignment at least in this paper or in the topic we talk about today what this paper does is it says look these steps they always need this reference model this supervised fine-tuning reference model um they may even need a reward model in between uh or maybe not but it's always sort of a multi-step process so you do the supervised fine-tuning and then you do the alignment procedure and so on and they are going to suggest a method that encompasses not just sft and rhf but sort of the whole sft plus alignment um in one in one procedure and that is orpo I keep forgetting the acronym right here that is Oro and they do how do they do that mainly by looking at sft and analyzing it and coming up with a hypothesis of why isn't supervised finetuning why can't we just tackle the same data oopsie the same data as here like let's say we always we have always pairs of s we have X and we have y w they're going to call it and Y L so we have a prompt if you will and then we have winning responses like responses that are approved of and then we have losing responses which are responses that are disapproved of and the goal would be to make these more likely and these less likely why can't why doesn't supervised fine-tuning work for that in some way they analyze that uh and then they come up with uh a method to integrate that in supervised fine tuning so that you don't need two steps anymore that's essentially it and the results here while not ginormous gains as far as I can assess them just from the paper right here I haven't tried the method myself but just from the paper the improvements do seem to be uh consistent and uh do seem to make a difference a notable difference so yeah it could be an option and also notably um you don't need this multi-step thing anymore you don't need in between models and there therefore you may even save on compute a little bit while getting better okay so what's the well I wouldn't actually have to have drawn any of this but you essentially see the same thing right here where on the left hand side with rlf and direct preference optimization you do have this multi-step process and uh with orpo you do not because essentially what they do is just they build one lot from the two steps so they build a joint loss and then they optimize that loss in one go so the loss is going to be maybe that's actually worth it to go to actually just jump to the ultimate loss they're going to implement because I think working backwards from there is a bit easier than working forward so working backwards here is the loss they propose this is a loss that you would do during a step that is equivalent to the supervised fine-tuning step except that you not only need XY pairs but you actually need uh preference data so uh winning outputs and losing outputs because you need to do the alignment you want to know what should you align towards and what you should away from what they're going to end up doing is they're going to take the regular supervised fine tuning loss we are going to look at that in just a second and they're going to mix it with this o loss o stands for odds ratio and there's going to be a hyperparameter trading the two off so the sft loss is largely going to be responsible for actually creating an instruction following model right not just remember this step starts with a plain language model so you want to make it instruction following and the or loss is going to make it such that the

Segment 3 (10:00 - 15:00)

instruction following happens in a way that you approve of and doesn't happen in a way you disapprove of and as I said the data here is going to be some sort of a distribution of prompts winning responses and losing responses given some method of collecting these be that humans or synthetic or proxy or heuristic or something like this so the loss of supervised fine tuning maybe that's good looking at that the loss of supervised finetuning is log likelihood uh loss essentially if you what you want to do is you want to input your prompt and you want to have an output that you want to make lightly right you uh you take you essentially say Okay given this prompt I want like I want this answer to turn out and what the internally in language models you're going to do is you're obviously going to predict token by token that's why this gets split up over all the tokens from for each token you predict it from The Prompt and from the last tokens and you're going to make the log likelihood of this token given this input right here as high as possible what does that mean in practice that means you have your da da d da that's your X then you have that's your y so far right and now you need to produce this token right here so what you're going to do is you're going to take this right here and shove it through a neural network this is typically going to be a some sort of a auto regressive causal attention Transformer but it doesn't need to be but I guess nowadays that's quite a safe bet to say so you're going to shove this through some sort of model and out comes a distribution out come these logits and these logits uh you would normalize and then you will find that you will get some sort of distribution okay so I'm going to make these slightly bigger here so I don't need to draw as many so you'll get some sort of distribution over your vocabulary right maybe this is cat this is dog this is house this is ears and so on so each token that could be produced is assigned a um probability if you will and you're going to select according to this distribution right here there's going be a temperature parameter and whatnot what this log likelihood loss does is it essentially says there's going to be the correct token correct according to the training data which is maybe going to be dog right here I want to make this one Higher and everything else lower right every single other token goes lower so dog goes higher every with this loss can you imagine why models don't automatically get aligned if you just do this you can obviously only do on the winning pairs right but you might think well if I train on the winning pairs uh the models should learn that why don't they become aligned why don't they automatically make the losing ones which are different from the winning ones lower at the same time and here you can see because it actually makes everything less likely and since our vocabulary sizes are what a 32,000 tokens uh by now or something like this it means there's going to be one token that's correct right and other token let's say the Y L and the YW actually start with the same few tokens but then they're different right you say okay how are you doing and the one you want to align with is I am doing just fine and the one you don't want to align with is I am terrible today right because you want to train some sort of a friendly ever friendly chat bot um you're going to start with I am and then there's going to be one correct token and there's going to be one token you surely don't want right but this loss essentially does not differentiate the one you surely don't want from all the others that are also just wrong um given if you just look at the winning answer and they hypothesize or they propose

Segment 4 (15:00 - 20:00)

that due to this and they're not the first ones to recognize this flaw in uh in log likelihood loss but they say okay because of this essentially alignment doesn't happen automatically actually what is probably happening during this procedure is most of these other tokens the fact that they get pushed down most of the effect of that is just that the model is going to learn sort of grammar now grammar in itself it should already know from language modeling but it's just going to learn kind of the structure and grammar of the type of outputs that you want meaning I give you an instruction and you give me some sort of an answer some sort of a fulfillment of that instruction and that will have very often like you narrow down the scope of language considering once you do that and all of this training right here what it's largely going to do is just going to make the language model into a general instruction following model which means the end effect of this is even though the losing answer is obviously also among all the ones that get pushed down in end effect because the losing answer is probably more similar to the winning answer than random output right like all of these others are essentially random output um because they're more similar in fact supervised fine tuning will also increase the likelihood of the things you don't want to align with and that is one of the experiments right here so you can see that log probabilities for chosen and rejected responses during fine-tuning on an rlf data set despite only chosen responses being used for supervision rejected responses show a comparable likelihood of generation so you can see that even the rejected responses they increase in likelihood and they do not lag Far Behind The Chosen responses right here so this failure of supervised fine-tuning to distinguish between pushing up the correct one and pushing down all the others distinguishing this um from specifically pushing down the one that you don't want to align with that piece they say is missing so they introduce that piece in form of an auxiliary loss regularization not regularization but just an auxiliary loss that you can add on top of supervised fine tuning and you will incorporate that piece of you know that desired Behavior Uh pretty straightforward but it does mean now you don't need this sequencing of training anymore the thing they propose here is based on these uh odds so the odds um the odds of this right here is going to be the likelihood of so the odds of Y given X is the likelihood divid 1 minus so y given X is any event uh the odds tell you how much more likely is the event to happen than to not happen right this here is the event happens and does not happen so if this is five uh it's five times more likely that it happens then it doesn't happen which is 80 something per I guess if the odds were four it' be 8020 but I chose a poor example then from that they can Define the odds ratio is simply the ratio of the odds between two things so you there you need two possible outputs and uh given the same I guess so I guess this should really say given X right um so should probably say that I mean you can generalize it you can say the odds ratio between y w given X and um y l given X but you don't need to if you always consider the same uh prompt here so the odds ratio is now a ratio it essentially says how much larger are the odds for the winning output versus the ls from the um losing output that doesn't go Direct into the loss but we're going to do the following

Segment 5 (20:00 - 25:00)

we're going to take the log of the odds ratio which um comes down to just the subtraction of the log odds of each of them and we're going to put this through a nonlinearity and again take the logarithm of that now there are a lot of logs involved right here but essentially um I guess this log transforms stuff into the kind of loged space and this here if this is let's say a sigmoid nonlinearity kind of transforms it back into the probability space because it's now between zero and one right uh and then this thing right here would push it back to the logit space I don't know that's how I imagine it but um it turns out that if they do that right um so and then there's a minus so essentially what they want to do is they just want to make that go up like this part here go up and then through the minus sign here it becomes a loss so they just want the stuff that you want to align with to be more likely than the stuff you don't want to align with given the same prompt they do a good investigation they actually calculate out the gradient of this part of the loss and that can give you some insight of what actually happens during training right because you train with gradient descent so what's the gradient of this part of the loss then we can once we see the gradient we can know okay to the thing we've demonstrated down here how will that change if we add that piece of the loss so what is going to happen they say the gradient and they have a full derivation of this the gradient consists of two different things gradient consists of this Delta term and this H term the Delta term is the following uh fraction notice the inverse right here so this is going to be a one over this um there is a one plus right here which means that if this uh if this thing here is really small no yeah then the whole thing essentially has no effect however it's or it's small if this thing is really big compared to the top thing right so then it's it becomes really small and that means the odds of the losing side if are really big then this has no effect if the odds of The Winning Side are greater than the odds of the losing side then this in fact does have an effect so this thing goes up this thing becomes big and then this thing because of the inverse becomes small so um the whole term is small when we are already correct and is bigger when we are not yet aligned so when we're already aligned when the winning output is very high then this becomes goes towards zero and therefore this multiplication the whole term has essentially not really an effect and that's what we want if we're already aligned good however if we're not yet aligned then this will go more towards one and therefore so because this here will be big and then the whole fraction will be small and then the inverse of 1 + 0 is 1 and then the whole thing will go towards one so here towards one and then this has an effect so this is something between zero and one now uh um if it's one what is the term that is introduced okay so you can see here that's the term that's introduced and it's one gradient minus another gradient so this term is largely going to push into the direction of this gradient and push away from the direction of that gradient and as you can see it's exactly what we want to do it's going to push into the direction of the winning answer and away from losing answer both terms are weighted by the by their own unlikelihood in a way so the gradient of the winning answer is scaled by one minus the likelihood of the winning answer so if the winning answer is already um quite likely it's going to push it more into that direction if the losing answer is already quite likely it's going to push it away more from

Segment 6 (25:00 - 30:00)

that direction now I personally am not super sure if this effect over here makes too much sense like um but remember the fraction of up here only really makes the term active when we're not yet super aligned so therefore it could be that both are actually very likely uh yet it's just a lot more likely to generate the losing answer in that case this would pull hard but then this would also pull hard so um if the model is just really sure about everything the gradient is going to be bigger in both directions um if the model is not so sure about something I would guess then um it's not so big I'm not sure if that makes sense or not like intuitively but everything up until here like I can get on board with everything except the things under the fraction right here maybe there is a better way of scaling that but it is what it is in this method they also discussed this so they say in equation 10 implies a weighted contrast oopsie a weighted contrast of the two gradients from The Chosen and rejected responses like that's very good specifically 1 minus p in the denominators amplifies the gradients when the corresponding side of the likelihood is low really for the chosen responses this accelerates the model's adaptation towards the distribution of chosen responses as the likelihood increases yes that I can see it accelerates the model's adaptation toward the distribution of chosen responses as the likelihood increases that I can get on board with but here amplifies the gradients when the corresponding side of the likelihood is low I think this isn't this exactly the other way around isn't it so if the likelihood is low then this is high no wait if the likelihood is I'm failing at basic math if the likelihood is low okay then that means this thing here is high which means it does not accelerate the gradient the likelihood is low the thing under the fraction is high which means that the whole fraction is low which means it does not accelerate the gradient one minus P amplifies the gradients when it the corresponding side of the likelihood is low amplifies the gradient maybe I'm maybe I'm super confused or I read something wrong right here or this sentence got confused the second part down here I completely agree with you make your own conclusions in terms of the results they do quite a bit of results here um Investigation into different aspects of this um including also some auxiliary experiments and the diversity of outputs and things like this and you can see from the results that they as I said sometimes it's not a giant Improvement but it does seem to be a consistent good Improvement um largely across the board I think and um you can also see that it does pretty well in comparison to models that are larger than itself so this is a fine tune or a Oro alignment of a mistal 7B model and you can see that it does comp compare well or favorably or at least on par on the same level as things such as uh llama 2 that is 10 times larger all right I want to leave it at that this I think um concludes the overview right here they do go in to other things oh this I found interesting where they compare it to the probability ratio you can ask it why do you go through all of this trouble uh with the odds and then the odds ratio and so on couldn't you just make this a loss term like couldn't you just maximize the probability of the winning stuff minimize losing stuff if you take the log of this right you can see that directly becomes the sorry the log of PW let's call it

Segment 7 (30:00 - 33:00)

minus the log of PL so intuitively one would have said well this is already the sft loss can we just sort of subtract like just subtract the opposite for or do the opposite for the losing output and it works in like sort of the directional sense but they also show that for the given for the same pairs of likelihood the odds ratio is a much more um balanced and flat if you will distribution of values rather than these probability ratios they tend to get extremely spiky so as soon as one probability is a bit lower than the other it's already like quite out of um out of whack am I doing this correctly oh no sorry the other way around it's just numerically not advantageous to do the probability ratio but it's I think the other way around that I just said um the odds ratio is quite nicely distributed meaning that for a given difference in likelihoods you do have a wider range of values uh to to consider to input and so on and if you deal with probability ratios then you stuck with a narrower range of values that the ratio can take you can see right here distribution being super peaky I guess this is a lesson in um in how you can frame things CU before I was just saying okay how advantageous it is to not have too many values and I was like ah this is so bad because a little change will make a big change in the ratio but now that I know that they like the different thing I will say oh it's really good that it has a big range in ratio in any case I do believe them when they say numerically the odds ratio here is preferable and they also show that during training when they train with that additional loss in general you can uh see during training this quantity in itself go up and then nicely so remember the graph from before where the rejected responses were like this now nicely the rejected responses actually do get rejected obviously the big question is what's the tradeoff like how much of the of what do we lose by adding this loss there's always a trade-off the trade-off here might be that uh the instruction following capabilities in itself might not be as pronounced in these models as they would be if we were to just TR train with that loss however the numbers on the different benchmarks do look okay in my opinion and therefore um at least you can say well maybe to humans that doesn't really matter that much all right good that was it for me uh thank you very much for listening looking uh please stay hydrated and have a wonderful rest of the week bye-bye

Другие видео автора — Yannic Kilcher

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник