Gradient Surgery for Multi-Task Learning

32:16

Gradient Surgery for Multi-Task Learning

Yannic Kilcher 21.04.2020 9 058 просмотров 278 лайков

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Описание видео

Multi-Task Learning can be very challenging when gradients of different tasks are of severely different magnitudes or point into conflicting directions. PCGrad eliminates this problem by projecting conflicting gradients while still retaining optimality guarantees. https://arxiv.org/abs/2001.06782 Abstract: While deep learning and deep reinforcement learning (RL) systems have demonstrated impressive results in domains such as image classification, game playing, and robotic control, data efficiency remains a major challenge. Multi-task learning has emerged as a promising approach for sharing structure across multiple tasks to enable more efficient learning. However, the multi-task setting presents a number of optimization challenges, making it difficult to realize large efficiency gains compared to learning tasks independently. The reasons why multi-task learning is so challenging compared to single-task learning are not fully understood. In this work, we identify a set of three conditions of the multi-task optimization landscape that cause detrimental gradient interference, and develop a simple yet general approach for avoiding such interference between task gradients. We propose a form of gradient surgery that projects a task's gradient onto the normal plane of the gradient of any other task that has a conflicting gradient. On a series of challenging multi-task supervised and multi-task RL problems, this approach leads to substantial gains in efficiency and performance. Further, it is model-agnostic and can be combined with previously-proposed multi-task architectures for enhanced performance. Authors: Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, Chelsea Finn Links: YouTube: https://www.youtube.com/c/yannickilcher Twitter: https://twitter.com/ykilcher BitChute: https://www.bitchute.com/channel/yannic-kilcher Minds: https://www.minds.com/ykilcher

Оглавление (8 сегментов)

Introduction

hi there today we're looking at gradient surgery for multitask learning by tianhe you Sarab Kumar a BA Abhishek Gupta Sergey Levine Corel Houseman and Chelsea Finn so in this paper um the concern is a thing called multitask learning now what is multitask learning

What is multitask learning

so this is has some very subtle distinctions from other things that's I think why it's important to look at it a bit so let's say you have multiple tasks a learning problem in multiple tasks this seems easy enough right and so what we mean is that we have the same input but then we want to perform two different tasks so task 1 and task 2 sorry - so it could be something like task 1 if the input is a food right a food object the task 1 could it be is it a fruit right task 2 could be how many calories does it have right the input is this food item here and you want to know both things is it a fruit and how many calories does it have and ideally so what you could do is you could train two separate machine learning classifiers right classifier one simply does the easy the fruit thing task 2 simply does how many calories does it have let's say this is let's actually say this is a food picture right since Instagram is full of food pictures we have lots of training data right at least unsupervised people usually label it and we could train two different things but it would be nice since they're both kind of dealing with the same input so they know kind of they actually deal with the same input distribution it would be nice if we could kind of share a representation right so maybe we have some neural network here with many layers and then we have at the end we take this hidden representation here and we just have maybe one or two fully connected layers for each individual task but our goal would be that the hidden representation here is shared so shared representation and why could that help because we might have maybe we have lots of training data for the how many calories does it have right but we don't have that much training data for is it a fruit so lots of training data here a big database but only like a handful of data points for the second task or we might just not have much training data at all for both tasks and we just might benefit from training this shared representation you might have already seen this with something like Bert so in Byrd's case the input is text right and then you do something different that's why Bert is different than multi task learning what you do in Bert is you do first you do this masked language model free training so that's step one and then in step two you take this and then you fine-tune it on a number of tasks right so here question answering sentiment detection entailment and so on this is different this is called pre training and fine tuning in multi task learning we actually want to train on different tasks at the same time maybe they have different data right and we simply want create this shared representation and we hope that by combining these tasks we might learn them better than if we were to learn each task individually all right so this paper says there are there's a big problem with things like

Example

this and they illustrate this in this example right here so let's say you have a multitask objective and the learning landscape looks like this so the objective for task 1 is the following so this you have to imagine this is maybe a neural network with just two weights right here is weight 1 and here is weight 2 and this is what the optimization landscape look what looks like for task 1 if you're not used to this kind of depiction the light parts' up here and here are high values for the loss function and the darker parts are low values for the loss function so you want to get to the darker parts now usually we discuss things like this in terms of optimization so for example we would talk about SGD and the fact that oh if we have too large of a so if you're here where does the gradient point the gradient points towards the direction of steepest increase so here so the negative gradient would point down and if we have an SGD maybe we'd go here and then we take another gradient step we'd go here right now we've gone too far right so the gradient now points this direction so we go here and then we just continue this right so this is a problem with SGD and what we can do is we can decrease the step size for example and then we converge in this or we can use something like atom that adjusts the gradient to the variance of the gradient landscape things like this right so these are problems in optimization but what happens when you have a multitask objective is that for just task 1 the optimization landscape would look like this right if you were just to train your neural network if you were to just train this part and we just look here is like theta one and here is Theta 2 these are the two weights we care about right now everything else let's say is fixed task

Loss Function

one looks like this but for task two because it's a different task right we need to set the weights differently to get our desired out but it looks different so our loss function is going to be a combination the loss of task loss function for a given sample it's going to be the loss on task 1 of that sample plus the loss of task 2 on that sample so that's going to be the combination you see on the right so this Plus this equals this right here so you can see in task 1 it almost let's say it didn't matter whether we were here or here both had a relatively low loss value right but you can see in task 2 this point here is not an optimum well this point or maybe these are like these are seeing what some were close together so if you add them you can see that now this thing here still has a low value but not as low as this is much darker right so the landscape for both tasks together looks differently from the landscape of either task alone so your goal is to find this optimal point and optimal point here that works for both tasks now the paper identifies many servers are you a the paper identifies problems with this multi task learning and they say the problem is that you can have what are called conflicting gradients so if you look at where the gradients point in the different tasks so if we go by task - sorry let me put that in again and we care about the point right here that they care about right and they use atom in this case and their starting point is right here and they've come this way so far so we're going to draw this in here and draw this in right here and we'll stop a little bit before that Valley right so let's analyze the gradient and task one actually points in this direction you see it down the valley right and it's pretty big because it's pretty steep right you can see the curves here getting closer and closer together that means the gradient is pretty steep and it points in that direction whereas for task two if you're here right the gradient actually points in this direction but not as steep right because here the lines are pretty far apart still so that means it's relatively flat this is what the paper calls conflicting gradients and they're drawn in here and I'll draw them just a little bit larger so these two gradients first of all they have different magnitude you see that the magnitude of this is much larger than and also their angle between them is large that means conflicting that they're more than 90 degrees apart from each other and this results if you calculate the resulting gradient of course this results in a gradient like this right so our algorithm wouldn't actually go down this valley it would go up the hill again because you have differently sized diff and gradients from the different tasks that go in different directions now the important point I was wondering for a long time what's the difference between this and simply saying look a any data set right here a loss on any data set D is just the sum of the loss of your individual data points X I because it is the same case that you can have different data points and the gradients that you get right so that would result if you've never done optimization I'm sorry I'm going a bit fast that would result in the gradient with respect to your weights of your loss over the entire data set is of course approximated by the one over N in your mini batch so by the gradients right so let's call this the loss of X I this is completely illegible but what I'm saying is that your total gradient is the average of your individual data points and these might be conflicting as well right you could have that one points in this direction and the other one points in that direction and we've done this just and things like Adam and the SGD actually are able to handle that just fine because we do this average operation I think what is different here is in multi task learning is that the multi task the tasks distribution is not like stochastically iid let's say so in this case you can always count on that the expectation will average out this noise so this noise if you go in expectation right if you do mini batches and in and aggregate over the whole data set then that will kind of even out because for the different data points okay one gradient might be larger one might be smaller but there is no systematic error or there's no systematic bias that comes from the different data points here you have as we said one task might be much harder than the other task right or you might have much more data or the loss function is just larger like magnitude wise so you can have any number of systemic biases that the different tasks have with each other and therefore the conflicting gradients seem to be a problem so this paper does a good job of analyzing the situation of conflicting gradients and what I find particularly interesting is that they first of all they propose an algorithm to deal with these conflicting gradients so they say whenever two gradients are conflicting right what we would do is we would project them on the normal plane of each other right so for example here in the in step B we take the gradient of tasks I and we project it onto the normal plane of gradient from the task J right and if we do this and they have a whole algorithm where it's in general so if we do this for multiple tasks so basically we get a mini batch of tasks right so they generalize this to that you have a bunch of tasks we get the different gradients and these can be stochastic because we can do this with stochastic datasets we go through the batch and if the gradients are conflicting we simply project the gradients onto the onto each other and that will result now in a set of non conflicting gradients you might be a bit

Theorems

appalled by this I was at first when I saw this but they actually do as I said a good job of analyzing this so they have two theorems here which I find interesting so theorem one is assume these are convex and differentiable so somewhat standard assumptions in optimization they say then the PC grad update rule with a step size smaller than one over L I was to Lipschitz constant will converge either to a location where the cosine is exactly negative one between two gradients you can that never happens except if you construct it or the optimal value right so this is basically a consistency theorem saying that this algorithm will still converge to the optimum value this here is this is the loss so this loss is the sum of lost one and lost two right of these two tasks so for two tasks they prove that the algorithm will still go to the correct point if you run it long enough doesn't say anything about the speed though this is where theorem two comes in theorem two says suppose L is differentiable and the gradient of L is Lipschitz continuous but this again same assumptions except no longer need convexity let theta M T which is the multitask gradient and theta sorry not the gradient the parameters theta P C grad B the parameters after applying one update to theta with G and PC grad modified G so this M T is the that would be kind of the original algorithm without their method and this here would be with their method moreover assume a bunch of things which will go into soon then the loss function of the PC grad theta is smaller or equal than the loss function of the MT of the original so what does it mean it means that if you're in your optimization landscape and you're somewhere here right and your optimum is somewhere here and your loss function is kind of how far away are you from this optimum right it means that as long as these conditions are given if you do your update without the their method which would be so here would be theta empty or with their method theta PC grad then the loss function that you get from their method will be smaller than the loss function that you get without their method right so this is a theorem they prove it and for this to be the case they need these three things so let's go from the back the third one is a condition on the loss function sorry on the step size and you can say okay the step size needs to be large enough you can set the step size this here what is this here needs to be a this is a condition on the on this epsilon so what's this thing it is a curvature bounding measure and that is compared to little L and little L here is this thing it is a constant that must be smaller than H and H is up here is the curvature so it depends on

Conditions

the curvature right fulfilling some condition they stayed down here the curvature of the multi-task gradient should be large yeah and the first condition we've already seen is that the cosine of the angles needs to be smaller than negative something that depends on the gradients and this here turns out actually to be the magnitudes of the gradients so this first this here we can neglect that's a step size condition this here means the gradients should be conflicting and this here means that there should be sufficient curvature in the loss function this is exactly what we saw at the beginning in this sum and this thing here so there was a sufficient curvature because in one direction the gradient was very steep and in the other direction it wasn't which basically means there is a change of steepness right in one direction versus the other direction and also the two gradients were conflicting which we saw right here if this is the case then this algorithm will bring you faster to the optimum then the normal algorithm but only

Evil Try Effect

if this is given and notably this can change step the step they actually call this the I think the holy trifecta evil try effect something they have a name for it but I'm gonna read the conditions that how they describe it the conditions are first the angle between the task gradients is not too small ie the two tasks need to conflict sufficiently second the difference in magnitude needs to be sufficiently large third the curvature of the multitask gradient should be large and fourth the learning rate should be big enough such that large curvature would lead to over estimation of performance improvement on the dominating task and under estimation of performance degradation on the dominated task so here you see a little subtlety I said before that this condition here was negligible because you can set the task size in actuality this you can so I'm not meaning to rag on this but what does it mean the learning rate should be big enough such that blob of a and what comes here seems to be negative right such that the large curvature would lead to overestimate which basically means this method this thing here counts if the step size is large so that means if I were to play devil's advocate if I have a problem like this I could either write I could either use their method PC grad or I could just decrease my learning rate and use the classic algorithm because if I just decrease my learning rate relative to the curvature then this theorem would no longer hold and it would not no longer be the case that their algorithm gives me a faster convergence so there is there's two ways of looking at these things it's like yes in these conditions this algorithm is better but it is better because someone has set the learning rate too high and this algorithm kind of fixes that now the upside to this is of course that usually you don't want to kind of set your learning rate in accordance with the curvature of the problem and so on you don't know the curvature most of the time so you just set some learning rate and their algorithm appears to be working also when this learning rate is smaller it's just not guaranteed to outperform the classic algorithm but I just found find this interesting in terms of how you read a paper right if you come across something like this these conditions you can always see them as here is what needs to happen for us to succeed or the others to fail and therefore we're the only ones that succeed in this regime though yeah as I said it's a cool algorithm but I found that to be funny all right so they test this on multi task which

MultiTask Learning

these MT 10 and MT 50 benchmarks are these robotic manipulations so multi task doesn't only mean like supervised learning in this is actually multitask reinforcement learning so here you have everything if mini-batches you have episodes and you have multiple tasks so this is everything together very cool and you in their actual implementation they say what they do is they have these multiple tasks so they have the agent and they first select the task so for example here pull this right then they generate an episode by interacting with the environment forth and back then they put that episode into a replay buffer then they maybe select another task and so on so until they have a bunch of data in the replay buffer from different tasks then they sample episodes from different tasks right from task 1 task 2 and so on and that will become a mini batch in the learning procedure so pretty intricate thing but of course you the hope is that you can learn kind of a shared representation that you can then perform all of these tasks faster than if you were to learn them each independently so the MT 10 and MT 50 come from this and I think they also have goal condition pushing where the task is simply to push something to a what they call goal conditioned and the cool thing about this is it's not only 50 tasks but you can produce an infinity of tasks because you can always specify a new location where you should push something to right so that's fairly cool and oh yeah the curse so you see that if you do something like soft actor critic or multi-head soft actor critic so this multi hat soft actor critic is probably the closest to what I defined in at the beginning where you have this shared representation and then the individual heads and you can see that severely under forms against the si plus PC grad roster method that seems to outperform fairly consistently even against learning the tasks independently so it learns much faster than if you were to learn these tasks just independently from each other which is pretty cool right so I think that's pretty cool alright so they do actually interesting investigations first of all they research ok in during these learning runs how what is the curvature here and the curvature of the loss function they measure like this so basically all this is a consequence of a Taylor approximation so if you have like f of X you can write this as f of some x0 plus the gradient of F that times X here sorry at x0 times X in this direction and then if you subtract so this is a first-order approximation to this to the function on the right then if you bring this over here you or if you sorry if you subtract the two sides from each other then you can see there's the difference between the actual function and the first-order approximation of the function that must be or that is most likely the curvature now it is not it is like every higher order term but the assumption is that the dominant higher order term will be the curvature right so this is this would be this except they don't they do it not doing the X and x 0 they do it at theta T and theta T plus 1 so you can see this is the first order approximation and this is the actual function value after they do a step and the resulting thing will the curvature were dominated by the curvature so they analyzed this over the course of learning and they see that it actually increases as you go on and just I'm not a big fan of like just large numbers but they numbers seem to be large right it's just compared to what you can handle it with the computer the numbers seem to be large and they seem to be getting larger in order of magnitude steps across training iterations so I'm going to believe them that this curvature is given I would have liked to have it seen compared to just a single task instead of a multi task it's instead of you know comparing these things which is useless because they reach different losses right so it's pretty useless to compare their curvature across the number of iterations what I would have liked to see is a comparison multi task versus single task and to show me that in single task learning this curvature doesn't happen here you have the percentage of update steps where conditions a and B are held you remember condition a was the condition on the conflicting angle condition B was the condition that the curvature is large enough and you can see that as you go on with learning these dotted and dashed lines the conditions hold almost entirely at the beginning of learning but then still hold by in a big time of the step so here is like about half the steps still at the end of training these conditions hold so it is fairly good evidence that often the problems that they say are real or really there and then therefore their algorithm helps right so here's the average pert average return and interestingly they say in the text look this task here seems to be easier right and the task to which is the dotted line seems to be harder so si the baseline algorithm never really manages to learn task two whereas this PC grad manages after a while to learn it and at that point something happens over here which um I'm not super sure yeah that's what they say in the text but I have to squint a lot to see that exactly at that position something happens suffice to say that the PC grad is able to learn the task that si isn't able to learn because probably task one is completely dominating the gradient at that point right all right so this was the paper I invite you to read it and thanks for listening bye

Другие видео автора — Yannic Kilcher

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник