AMP: Adversarial Motion Priors for Stylized Physics-Based Character Control (Paper Explained)

34:44

AMP: Adversarial Motion Priors for Stylized Physics-Based Character Control (Paper Explained)

Yannic Kilcher 19.06.2021 11 162 просмотров 317 лайков

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Описание видео

#reiforcementlearning #gan #imitationlearning Learning from demonstrations is a fascinating topic, but what if the demonstrations are not exactly the behaviors we want to learn? Can we adhere to a dataset of demonstrations and still achieve a specified goal? This paper uses GANs to combine goal-achieving reinforcement learning with imitation learning and learns to perform well at a given task while doing so in the style of a given presented dataset. The resulting behaviors include many realistic-looking transitions between the demonstrated movements. OUTLINE: 0:00 - Intro & Overview 1:25 - Problem Statement 6:10 - Reward Signals 8:15 - Motion Prior from GAN 14:10 - Algorithm Overview 20:15 - Reward Engineering & Experimental Results 30:40 - Conclusion & Comments Paper: https://arxiv.org/abs/2104.02180 Main Video: https://www.youtube.com/watch?v=wySUxZN_KbM Supplementary Video: https://www.youtube.com/watch?v=O6fBSMxThR4 Abstract: Synthesizing graceful and life-like behaviors for physically simulated characters has been a fundamental challenge in computer animation. Data-driven methods that leverage motion tracking are a prominent class of techniques for producing high fidelity motions for a wide range of behaviors. However, the effectiveness of these tracking-based methods often hinges on carefully designed objective functions, and when applied to large and diverse motion datasets, these methods require significant additional machinery to select the appropriate motion for the character to track in a given scenario. In this work, we propose to obviate the need to manually design imitation objectives and mechanisms for motion selection by utilizing a fully automated approach based on adversarial imitation learning. High-level task objectives that the character should perform can be specified by relatively simple reward functions, while the low-level style of the character's behaviors can be specified by a dataset of unstructured motion clips, without any explicit clip selection or sequencing. These motion clips are used to train an adversarial motion prior, which specifies style-rewards for training the character through reinforcement learning (RL). The adversarial RL procedure automatically selects which motion to perform, dynamically interpolating and generalizing from the dataset. Our system produces high-quality motions that are comparable to those achieved by state-of-the-art tracking-based techniques, while also being able to easily accommodate large datasets of unstructured motion clips. Composition of disparate skills emerges automatically from the motion prior, without requiring a high-level motion planner or other task-specific annotations of the motion clips. We demonstrate the effectiveness of our framework on a diverse cast of complex simulated characters and a challenging suite of motor control tasks. Authors: Xue Bin Peng, Ze Ma, Pieter Abbeel, Sergey Levine, Angjoo Kanazawa Links: TabNine Code Completion (Referral): http://bit.ly/tabnine-yannick YouTube: https://www.youtube.com/c/yannickilcher Twitter: https://twitter.com/ykilcher Discord: https://discord.gg/4H8xxDF BitChute: https://www.bitchute.com/channel/yannic-kilcher Minds: https://www.minds.com/ykilcher Parler: https://parler.com/profile/YannicKilcher LinkedIn: https://www.linkedin.com/in/ykilcher BiliBili: https://space.bilibili.com/1824646584 If you want to support me, the best thing to do is to share out the content :) If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this): SubscribeStar: https://www.subscribestar.com/yannickilcher Patreon: https://www.patreon.com/yannickilcher Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2 Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

Оглавление (7 сегментов)

Intro & Overview

hey yo where's my money well give me my money all right we're going to get into this video in a second today we're going to look at amp adversarial motion priors for stylized physics-based character control by shiebin peng tu ma pietrabil sergei levine and anju kanazawa and this paper is in the domain of control and reinforcement learning but it's with a little bit of a twist so on a high level this paper uh trains an agent a physical agent as you can see here to perform some sort of goal in the case on the right it's walking up to a target and punching the target but to do so in a certain style and the style is provided by an expert data set or a demonstration data set so the technique that the paper presents mixes two things it mixes goal achieving reinforcement learning and it also mixes adherence to a given style and the that's going to be the adversarial part right here because that's learned in an adversarial way the mixture of the two uh at the end looks pretty cool so the setup right here is a setup of

Problem Statement

goal achieving and imitation learning as we have already um outlined and the way it works is the following there is going to be a task and the task can be you have to reach a goal punch something um you have to overcome some obstacles and then reach a goal any anything like this is a task so the goals are fairly high level and they are given obviously by a reward function so you place the agent in an environment and there is a reward function by the way the agent here is as we already also said is this sort of physical agent that um is going to have some sort of a 3d structure so there's going to be joints that it can move so there's a joint here and one here usually so and there's a head the agent is this physical thing and it's in a physics simulation and each one of these joints it can move kind of independently sometimes free uh as a ball sometimes it's restricted it's modeled very much like a human there are other i believe other models such as a t-rex which of course work differently but you have this agent and the agent is supposed to reach a goal like somewhere over here there's a little flag there's a goal and the way the agent can interact with the world is by putting force on any of these joints so it can move these joints in pretty specified ways and that constitutes the actions so the agent will observe the state and the state here is given mostly by it can observe how all the joints are currently uh the velocity of the joints or of the individual parts of itself in relation to itself so it can sort of feel itself and it also knows in which direction and generally how far away the target that it needs to reach is so that's the observation space the action spaces it can affect these joints and the reward function is often modeled in accordance with the goal so the reward function for walking to some goal might simply be you get reward if you are closer to the goal okay so this encourages the agent to go over there so we work with quite dense rewards right here um because i guess the fundamental problems of reinforcement learning aren't exactly the point here is here is can you teach these things to achieve a goal while maintaining a certain style now this is the task and the environment in addition to that you do get a data set and the data set is demonstrations of a certain nature so this is not necessarily demonstrations of how to reach the goal it can be any sort of demonstrations so usually when people do sort of imitation learning or learning from demonstrations there is a bit there are some requirements if you want to do pure learning from demonstration of course the demonstrations need to be how to achieve the goal um and that we don't have that here in other cases you do need the sort of policy or the action of whoever performed the data set we also don't need that here our goal is simply going to be we have to reach the task while um sort of adhering to the data set in a way and this way we're going to define in a second so the data set you can imagine um i think there is a good demonstration down here you can imagine the data set to give you sort of the style of movement so in one data set you can have running movements and walking movements and in another data set you could have these movements that where just the these actors walk like zombies and the goal here is to combine the style of the data set with reaching the goal okay so the combination would look like a zombie uh walking to the goal which adheres to the zombie walk in the data set and the goal in specified by the task okay naturally you're um you're going to model this as two different reward signals so there's the reward signals of

Reward Signals

how much you reach the goal and there is the reward signal of how well you adhere to the style in the data set the reward goal right here is modeled by classic reinforcement learning so this is very much very classic um where do we have it so you would simply train i don't even think it's it says here it's update g and d yada yada so this is a policy gradient method reinforcement learning which means that you do have a policy function which takes in a state and maybe a history and it will give you an you an action and with that you also train a value function that takes a state and will give you a value for that state now the value function is purely for training the agent because you do you do advantage estimation with this value function but essentially this is a standard policy gradient method that you train this part this lower part of the thing on sorry you actually train the whole thing on this reward um but the bottom part you can imagine is it a reward comes from reaching a goal the top part gives also gives you a reward okay and yes i want to reiterate both of these rewards are used to train the policy and the value in a policy gradient fashion so both rewards ultimately are in this standard advantage estimation reinforcement learning setting however the top reward is calculated differently than simply do you reach the goal the top reward is a measure of how close you are in style to the data set and that's given by this motion prior and the motion prior is given by a gan by a generative adversarial

Motion Prior from GAN

network and i'm trying to find the formula here um i think this here is the best description of it though it's just a formula so a generative adversarial model uh i'm pretty sure you're all aware uh there is a data set right here there is a generator right here the generator gets some random noise as an input it outputs a sample x from the data set you get a sample x prime or a mini batch and then both of these or the these either of these goes into the discriminator model and the discriminator has to decide for any sample is it real or is it fake so the way this uh generative adversarial network uh approaches the problem of specifying which motions are real and which ones are not is by looking at transitions so the data set here is not images or so like you're used to in a regular game but the data set is transitions what does that mean so in every situation your humanoid or whatnot is here and the goal is over here and this is one state this is s and then the agent takes an action okay the action could be please lift one leg and um how does that evolve so the new agent would be kind of here shifting the weight a little bit and lifting one leg okay so this would be one action which would lead to a new state s prime so you have three quantities you have the state you have the action that the agent took and you have the new state s prime now you could parameterize the transition either using state and action or state and next state the paper here does for the reason that in the data set that you get right here you do not have the action available you can probably guess it but you do have the state and the next state this data set can come from anywhere it can come from human demonstration key frames made by a 3d artist or maybe another agent that has already solved the problem therefore you don't always have the actions available so a transition is going to be specified by a state and a next state and the transitions from the data set are transitions that you observe in the real world so these are state next state act pairs that you observe in the real world and the generator essentially outputs state next state pairs now this generator isn't a generator in a like in a classic adversarial network but this here is generated by your policy interacting with the environment right so here's your policy it interacts with the environment and the environment gives you the state and in the next step it gives you the next state right so by interacting with your environment you do get state next state pairs these are essentially your generated pairs and the discriminator is trained to discriminate between whether or not a transition is from the real data set or whether it has been generated by your agent okay now of course this whole system isn't back propagatable and that's why you do train it using reinforcement learning so the reward the usual back propagation signal that you would have in a generator right here you can't do that that's why you simply take the output here the loss of the discriminator as a reward for the um for the policy right here so in this case the policy using a policy gradient is trying to fool the discriminator into thinking it into it thinking that the transitions that it generates come from a real data set while the discriminator at the same time is always trained to differentiate between the true data set and the transitions that the policy generates right so that gives you a reward signal for the policy and the other reward signal comes simply from the environment as we've already stated so these two rewards are then combined with each other and used to train the policy the discriminator itself as we already seen is trained um so this thing here is actually the discriminator this motion prior is trained one hand from the data set and on the other hand uh from the policy generating actions uh and generating transitions through the environment all right i hope that is a bit clear right here so there are many components to this but two are important the policy which tries to at the same time reach a goal and fool the discriminator those are two rewards their two rewards are combined and the on the other hand the discriminator itself simply gets transitions from the data set and policy environment interaction and tries to train itself to pull the two apart so it's a classic two-player game and yeah that is what you're used to from again

Algorithm Overview

all right and that's essentially it for this thing uh here is the algorithm we generally initialize everything there is a replay buffer like in a classic reinforcement learning which stabilizes training quite a bit i also mentioned the value function which is used for the advantage estimates of policy gradient so you for m steps you collect trajectories using the policy you already have then you feed the transitions uh to the discriminator right here now this here is a feature function of the state so you only they have special feature functions which make the this problem easier there's a lot of expert knowledge going into how you build the features how you represent the environment and so on so it's not quite trivial but i don't want to go too much into that you do calculate the style reward according to equation seven is simply the discriminator uh it's not the discriminator loss so the discriminator loss is actually is this thing right here they do use a square loss for the discriminator instead of a classic gan loss so the classic gan loss would be this thing up here where it's log d minus log one minus d yet they use this square loss that they found to work a lot better or least square loss you can see the discriminator is trained to be close to one if the data comes from the real data set which is capital m here and it's trained to be negative one when it comes from the policy okay so nothing stops the discriminator from spitting out any number like 15 or three it's just trained in a least squares fashion to go to these numbers which gives you a better gradient um so for this for these uh continuous control problems often uh you have to go to least squares objectives because which number is being output is often quite important rather than just a classification and even here where it is actually a classification loss right which is surprising but cool um and then the reward you know given a transition is calculated as so this is clipped at zero uh so this is also between zero and one as you can see here if the discriminator uh says one the reward is the highest the reward is actually one and when is the discriminator one the discriminator is one if it thinks that the reward sorry that the transition comes from the real data set so if the policy manages to produce a transition that the discriminator thinks comes from the real data set it gets maximum reward okay and if it also reaches the goal it gets maximum reward from that part of the reward signal too so the general encouragement that we give the policy is you should reach the goal in a matter that's consistent with the data set so it should probably pick out things that do both right it could try to um switch between the two modes like okay let's do a little bit of data set goal reaching but it's probably better if it actually picks things from the data set or behaviors that also reach the goal in a matter consistent with the reward with the task reward so the algorithm just to finish it goes on and it says um okay so this is the style reward the true reward is given by a mixture a weighted mixture between the style and the task reward and the weights you have to specify and then we simply store these this trajectory in our replay buffer and then we use the replay buffer to update the discriminator and we also value function and the trajectory according to policy gradient they point out a few things that are important right here to their algorithm one of them they find very important is this gradient penalty so again training can be a bit unstable and these gradient penalties they um are a way to stabilize this training and they found that simply penalizing the norm of the gradient as it comes out of the discriminator is stabilizing the training right here so this is one thing they've they helped uh they that they claim is helping them a lot to actually converge and this tells you a little bit that it's still quite finicky they talk a lot about um the representation of the actions right here the policy here in network architecture the policy and value and discriminator functions they are very simple multi-layer perceptron um so you can see like the mean of the policy function is specified by a fully connected network with two hidden layers consisting of and two uh five hundred and twelve relu consistently relu okay um i guess that's a fully connected layer with a relu non-linearity followed by a linear output so the networks aren't super complicated right here what's more complicated is the training procedure the loss the regularization constants

Reward Engineering & Experimental Results

and the reward engineering so there is a lot of reward engineering happening right here and that's what you find in the appendix so the reward for example um for going and punching something is threefold so if you are far away it's one reward if you're close it's a different reward and if that target has been hit it's a different reward right i guess the top line makes sense but the others are sort of reward shaping the behavior you want so you want the um the agent to kind of approach the target fast but then kind of slow down and also you know if you look at something like dribbling where there's a ball involved there is a lot of reward shaping going on even in uh in target location there is a lot of reward shaping going on where you sort of encourage the agent to have certain velocities and so on so um this is important because of the experimental results that they show and that's where we go back to the video where's the video right here so keep in mind um their point is you're able to reach a goal in the style of the data set so this is the simplest task they have it's called target heading and the goal is simply to walk or to go in a given direction at a certain speed okay and um the example clips they have are displayed on the right so the example clips are of someone walking and of someone running yet there is not really a transition in the data set from walking to running and the the agent learns to this transition by itself so their point is always look we have kind of simple things in the data set we have the individual parts in the data set that the agent should do but we never have the combination of all the things and to kind of stitch these parts together that's the powerful thing about this method which is pretty cool so here you can see at the top right there is a target speed and all of these three agents are trained agents and the um in the same manner right and they're all told to reach that given target speed however the agent on the left only has been provided with a data set of people just walking the date the agent in the middle the same but it has only received a data set of just uh agents running so no walking and on the right this agent has received a data set of agents walking and running so you can see that as the target speed changes the like if it's fast the walker is not able to keep up when it's slow the runner is not able to slow down however the agent that has the full data set available can not only match the speed and change its style according to the speed it can it also learns the transitions from one to the other and this these transitions are not in the data set itself okay so the cool part about this method is it can sort of stitch together the appropriate behaviors from the data set uh even if you don't provide these specifically to solve the task um the yeah this is the t-rex i think this is just to show that you don't have use motion capture but you can use it you can learn from a provided data set of keyframe animation and you can also see the there is nothing in the data set about reaching a goal there's just kind of demonstrations of the t-rex walking and the method is able to adapt this walking style uh in concordance with reaching a goal so you can see that the turning is much like the turning in the example clips whereas if you've ever seen things like this without um without the examples uh these policies that these things come up with are quite weird so here is a failure case and so the difference between this method and other methods is other methods um such as this motion tracking in the middle what they try to do is they try to match a given behavior from the data set as closely as possible um so this it's called motion tracking now there is a some sophistication to it more than i'm saying right here but essentially you have a front flip on the left and then the motion tracking algorithm tries to learn a policy such that the um the behavior is followed as closely as possible now again this is really good when you have the exact demonstration available from what you want to do it's not so good if you if what you have available as demonstrations is not isn't really what you want to do is just sort of some demonstrations but there are failure cases of course if you want to copy exactly so do a front flip and by the way the reward function here is how closely you match the motion from the reference motion so that's the reward function however motion tracking does more than that motion tracking really tries to track the motion itself while this method here would only get the reward of tracking the motion and you can see it doesn't manage to actually learn um it more like doesn't try it tries to not fail it so it reaches the same end position and that's sort of good enough for it so there is a yeah there is a trade-off right here it's probably also given by how much you weigh the different components so here you have a data set of uh agents walking and agents waving and then uh what you want to do is you want to have a agent that walks in a direction while they wave the arm or why they lift the arm or something so at the left you can see if you only have a data set of the waving agents it's really struggling moving forward right that the walking it learns it has no demonstration of walking so that's a struggle if you only have the walking demonstration in the middle then it doesn't really track the arm movement where it should even though there is a reward for it right only yeah on the right i mean this is somewhat but it is kind of able to um to interpolate so if you want to check out this video there is another one that actually explains the paper in a short form this is from uh from siggraph go check it out they do have more sophisticated behaviors so on the bottom here you can for example see the obstacle run leap and roll so the data set contains demonstrations from all of those things but not the things in conjunction with each other in this here at least what they describe in the text in this right here what they have in the data set is demonstrations of walking and demonstrations of getting up from the ground and whenever so the agent learns that whenever it falls over right here that it can get up faster if it kind of does this rolling motion right here so this was nowhere in the data set but because the agent wants to go to a get up state both because that will go it that will make it go towards a goal and also because that matches behavior in the data set it will learn this rolling motion as it falls down in order to get up again so that is that's pretty cool also in this strike and punch example the data set apparently only contains uh agents walking or agents punching it never contains agents walking and then punching so the transition that you saw at the beginning is a learned uh behavior that wasn't in the data set so that's i think it's a pretty cool application of and the combination of two things of adversarial learning and of um of learning sorry uh not from demonstration because that's that we're serial learning of learning to reach a goal and it's a good yeah it's a good demonstration of how you can combine the two they have a lot of ablations where they sort of show that the impact of the data set makes a big difference i mean you've seen this in the demonstrations but also here you can see that again in a graphical form so the locomotion data set contains both demonstrations of walking and running while the walk or the run data set only contains demonstrations of either and the here is the target speed versus the average speed that the agent does now if you only have a walking data set the agent no matter the target speed the agent will always kind of stick to walking and if you have the running data set it can run faster up here but if you want it to slow down it can't really run slower than you require uh only when the data set contains both things uh can it transition between the two and actually match the running or walking so

Conclusion & Comments

what do we think of this um my opinion is it's probably it's very cool and it is a it's a good way of sort of bringing demonstrations into the picture without manually like tracking the demonstrations or copying exactly so you just give some suggestions to the algorithm of what it could do and you do that in form of a data set which is something that i you know like because it's not as invasive as telling the agent you know you need to match the joint movements and so on of the demonstration this enables demonstrations to come in that are of a much broader range not necessarily reach the goal not necessarily even have a goal in mind so that's cool on the other hand i think it's pretty finicky because you have to strike the trade-off parameter between the two rewards quite cleanly or clearly for your goal because we've already seen right at some point the agent won't reach the goal anymore if if this reward here if the reward of the style is too high we already saw this if you have a data set of just running the agent will simply neglect the goal it won't go slower than you know the kind of the slowest run or demonstration or a little bit slower than that it just won't change its policy because it needs to match the data set and the this balance seems to be quite a important hyper parameter and that also makes the provided data set here quite an important thing to have available so which data set you provide is also quite important and lastly the tasks themselves or the reward of the goal directed task nature or in this paper extremely engineered and that's what i want to come back here lastly too so what they tout for example in this um walk and punch thing they say oh when the agent is far away it runs towards the target but if it's close it only it slows down and then when it's really close it punches the target and it sort of learns to combine these different skills but and which is cool right because the transition wasn't in the data set but a big part of it combining these skills is because in the reward you make the reward different whether the agent is far away or whether it's near you can see that right here so these things are reward shaped to a high degree to encourage these kinds of transitions to happen which i think is not really practical in a lot of settings so it's still to be seen how much this is of of practical value in other reinforcement learning tasks where you don't have that available and also in other reinforcement learning tasks where maybe the reward is more sparse and how that affects this thing because essentially if the reward is much more sparse and irregular now you have a problem because now the style signal is much more prominent and that's not necessarily solved by simply re-weighing the style signal so i'm excited to see what comes out of this line of work next it's a pretty cool line as i already said it's a good application of gans in a different field than images and with that let me know what you think in the comments i'll see you next time bye

Другие видео автора — Yannic Kilcher

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник