Generative Modelling | Sadhika Malladi | 2018 Summer Intern Open House
12:43

Generative Modelling | Sadhika Malladi | 2018 Summer Intern Open House

OpenAI 11.09.2018 6 339 просмотров

Machine-readable: Markdown · JSON API · Site index

Поделиться Telegram VK Бот
Транскрипт Скачать .md
Анализ с AI
Описание видео
Sadhika Malladi's presentation about her work during her summer internship at OpenAI. Recorded at the OpenAI 2018 Summer Intern Open House on August 16, 2018.

Оглавление (6 сегментов)

<Untitled Chapter 1>

hi everyone I'm Satya I'm an undergrad at MIT I'm doing my internship here at open AI broadly my internship is about

Generative Modeling

generative modeling and so what are generative models they are models that generate data within some kind of distribution so for example if you give your model to train on dogs then you want your model to generate images of previously unseen dogs why is this useful it's not an inherently clear from my explanation but when you're able to generate images of dogs you hope that your model has learned something more useful about exactly what is in the images of these dogs and that learned some kind of structure that's important so there are many applications for these types of models the first one

World Modeling

I'll be talking about is world modeling and well modeling is useful for sample efficient generalizable reinforcement learning so what exactly is world

What Exactly Is World Modeling

modeling well in our hopes and dreams we want our agents to first learn basic movements so for example when you're learning how to walk you learn how to put one foot in front of the other then you learn how to turn directions without falling and maybe even go backwards once you learn these basic movements you no longer need specific instructions on how to get from the couch to the TV or from your house to school because now you have a sense for how basic movements cause interactions in your world and so the idea is if we can train our agents this way then hopefully they're able to do more complex tasks that they haven't necessarily been trained to do and we can train them a lot easier using transfer so we want to use less data to train our agents because it's expensive to collect this data from our games that's specifically like episodes and things like that so in this example here I have a basic game called car racing and so in this game as you can imagine when you're learning how to drive a car you learn how to go forward faster how to slow down how to turn and so you can see in the video that the car is able to go on this track and it's never actually seen this track before and so it's learning how to turn and when it messes up it's able to recover because it knows exactly how its movements will interact with its environment and so this is why model based reinforcement learning agents would be really cool and these are our hopes and dreams for them but alas there are some shortcomings so first of all a lot of the implementations out there right now for model-based reinforcement learning are game specific they're also difficult to train kind of finicky and one of the biggest problems is that if you have a small artifact when you generate a frame of a game when you keep rolling out and trying different actions within that particular world model you're just gonna build up on your artifacts you're never going to be able to fix that and correct for it and so these small errors can propagate really quickly and cause a lot of problems when you're rolling out and obviously for some number of these reasons we're not able to get model-based agents to do as well as model free agents so in the example I show on the right I'm showing the same exact model that trained the car racing video but now you can see in the ground this is a game of pong you have two paddles in the ball you want to keep the ball and play but when you reconstruct after encoding it you no longer see the ball other paddle and so what I mean by reconstruction I'll explain in a second but key part here is that you're not able to see the important parts of the game so you're not you don't know where the ball is and so it's hard for the model to learn so how exactly was that car racing video trained while you take a frame and you run it through a variational auto encoder so you run a series of convolutions to get some layton's which I represent with Z here and then you run a series of D convolutions to get a reconstruction of your frame and you compare that reconstruction to the original frame in order to get some kind of error that you train with you also give these Layton's which you hope provides some kind of context of time dependent state in your game to an RNN and your R and then is going to keep track of things like where are you in the game how far your obstacles what exactly is your specific game state so the key here is that you don't want your Layton's to be wasted in encoding something that's true of all game States so for example if you're running through a grassy meadow you don't want your Layton's to only be encoding grassy meadow you want them to encode something like oh there is a rock there that I could trip on or something like that so in this video that I show it's again the car racing game on the left you can see a video of the ground truth and on the right you can see what the model actually reconstructs you can see it's pretty accurate it doesn't have those red bumpers which aren't really that important but the key part here to notice is why this worked on car racing and not on pong is because there's so few elements in car racing there's only the track and then there's this green highway so if you can roughly model those you're able to train your agent pretty well but it's not generalizable to more complex games or games where you have multiple scales of components so how do we seek to improve this well when we pair our BAE or our encoder with a strong decoder then our Leighton's are forced to mean only things that are dependent on the time which is what we want so for example in this case when they use a pixel CNN and we condition our pixel CNN on the time state of our game then we're able to force our Leighton's to just be something like oh there is an obstacle at X Y what does that obstacle look like well that's for the pixel CNN to generate and so by doing this we're able to create a more robust model of exactly how learn and how we think because I know that I could trip over this table but I also know I could trip over that couch and the details of what I'm tripping over aren't that important it's just that I shouldn't trip so we also know that pixel CNN's can generate really high resolution images and so we hope that they can generalize to games that look a lot richer than just the car racing game or the pong game even so this is the work that I've been doing with world modeling now I'm going to transition to talking about how I've been using generative models in the context of invertible flow models are known for being expressive models with efficient sampling before I talk about the architecture I'll show exactly what I mean when something's expressive this is a video of my friend Prafulla but in those gonna be here but he's here and basically what you can see here is propolis face morphing into different celebrities faces now why is this impressive because in order for our eye to look at that and say wow that's a realistic morphing every single one of the images that generated on the way there has to look like a person's face if something looks grotesquely wrong we will notice it right away and so this is telling us is that when you use an invertible flow model you learn the latent space so richly that you can interpolate between them coatings of different images and still have a reasonable interpolation so in my mind if I were to imagine prava becoming neil degrasse tyson this is how i would imagine it but i don't do that often so what do we hope what our

Expectations of Flow Based Generative Models

expectations of flow based generative models well what we want to do is take some random noise as you can see in the image on the far right you take some noise that's sampled based on some prior usually with the tractable density and then you warp that via your model into what you want to see so maybe it's faces maybe it's cats and dogs so basically what these equations are saying is you sample a Z from your prior and then you run that Z through your invertible network in order to get basically a generated output so now f needs to be invertible our model and we can use the statistical change of parameters formula in order to write exactly how the log-likelihood of the input can be expressed in terms of this model and so when we do generative modeling we always want to maximize the likelihood of generating whatever input we get because that's our best bet at making something realistic here what you have is you see that likelihood can be modeled as a likelihood of generating whatever that random noises person prior - the log determinant of the transportation the transformation between the latency and the inputs reality stuff it takes a lot of effort to design a fully invertible model there are very specific components that can be used and each of those components tends not to be very expressive so when you have when you want to build an actual model that's able to turn noise into phases you need to use an extremely large model which requires a lot of memory and so as you can see I have placed this meme here to demonstrate my frustrations in which you build a large model and you want it to be tractable but no matter how much memory you have you always run out of memory yes so the types of transformations you can do our things

Transformations

like affine transformations or one-by-one convolutions and I can talk more about that when I talk about the specific architecture so the first flow model was called real NVP and on top of that people that open they I have refined it to make what's called globe there aren't that many differences between the two but the basic idea is you take an input and you have a multi scale architecture which means that because you're only able to do one by one convolutions pick up features at a particular scale so you need to warp your inputs and reshape them so that you can pick up features on different scales and so you squeeze which means you just changed the dimensions and then you go through a step of flow which is shown on the left and then the step of flow is just a simple normalization convolution and then there's a coupling layer where you split again and you run half of it through a deep network and you run the and you just let the other half go through this deep network in the coupling layer is what takes up the most memory in our flow models so when I thought to work with these models I shared the parameters across the coupling layers across all of the different steps of flow and you would think that well that's a lot of parameters that you're losing that's probably like one out of every 32 flow steps is going to be actually expressive but what I found was you can generate very realistic-looking samples and match the bits per dimension using 3% of the memory that you did before and it goes 33% faster in terms of wall clock time and so these are just C far 10 samples and as you can see on both sides they're pretty realistic yeah so that's pretty much it for my presentation will is going to be coming up here to talk a bit about the work we've been doing together on improving flow models in other ways oh yeah yes yeah so in our and new architecture we only encode so we only go from the inputs to the Layton's and then we use the Layton's as a conditioning vector for the pixel CNN so the raw input also goes to the pixel CNN and that's why what comes out of the pixel CNN is what you actually train in comparison to the true inputs did that kind of make sense any other questions so there are actually some troubles that come with invertible models so the biggest Pro of using an invertible model is that you can sample really efficiently so to draw random noise is easy and then to run it backwards through the model is also easy however because you are constrained by having these invertible components you're forced to have a different level of expressivity and these can also be finicky to train because you can go into regions where it's no longer invertible or you fourth you're like forcing it to be invertible and it's not able to express what it needs to and this is part of what my work with Lille has been addressing

Другие видео автора — OpenAI

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Дайджест Экстрактов

Лучшие методички за неделю — каждый понедельник