Learning Dexterity | Alex Ray | 2018 Summer Intern Open House
15:36

Learning Dexterity | Alex Ray | 2018 Summer Intern Open House

OpenAI 11.09.2018 4 540 просмотров

Machine-readable: Markdown · JSON API · Site index

Поделиться Telegram VK Бот
Транскрипт Скачать .md
Анализ с AI
Описание видео
Alex Ray gives an overview of the work being done on the OpenAI Robotics team. Recorded at the OpenAI 2018 Summer Intern Open House on August 16, 2018.

Оглавление (5 сегментов)

<Untitled Chapter 1>

last hi folks I'm Alex Rey I'm on the robotics team here and this is gonna be a different sort of talk this is an overview of a recent large team result and instead of being these like really great in turn projects unfortunately they're limited on people and limited on time this is about 12 months worth of work for about 12 people so there's a lot to get through I'm only going to scratch the surface of it if you want to learn more we have a really well produced three-minute video it's on YouTube in our website it's really cool overview we have a blog post that's a nice like human readable format as well as like a longer research paper on it so lots of different levels of detail but for now I'm gonna give you an overview of what we did and sort of like a peek into the inside of it so quick outline of like I'm gonna break down my tiny amount of time is describe the task the sort of problem we try to solve the research process itself sort of like what solving it looked like the systems we built which were actually very simple and what results we got from them so our task we have this five finger dexterous robot hand it's under actuated it's got 20 degrees of control and 24 degrees of freedom for reinforcement learning folks that deal with continuous control that is an awful lot and we have objects in the hand that we would like it to be able to manipulate and for us manipulate means achieve arbitrary rotations so if you imagine holding a small object in your hand can you point it in arbitrary direction without dropping it so our primary goal this was sort of our North Star we want to manipulate this object of the robot hand and specifically we want a sequence of 50 independent randomly drawn rotations some secondary goals that we wanted to hit but weren't exactly necessary where we would like to solve it from vision which means you don't need a specialized object you can just drop the object in the hand and like as long as your vision model can see it you'll be able to related we want to manipulate diverse objects so not just cubes that say the letters have opening on them and we want to Train using a physics simulator without any real data and so again our North Star is a primary task the rest of it or things that are sort of reach goals so here like fish demonstrated here's an example of our physical setup there's this giant cage with a robot hand in it and on the other side is our simulation rendered with our simulation renderer the robot is a giant bag of unmodeled effects it has backlash it has transmission problems it has creep and stretch in the tendons and the simulator has none of that here's our secondary goal of the cameras is just sort of showing where the cameras are in the cage our secondary goal of diverse objects this is manipulating a octahedral prison which is kind of cool and then trading purely in simulated data so this is all the stuff we have during training is we both have a renderer that simulates our vision data and a physic simulator that simulates our physics data and sort of the process that we want to do it is a lot of dealing with robot hands it's not all training models but it starts with training models so we really just over and over for 12 months did this iterative cycle that got faster and faster as we got better at it used to take you no more than a month and now we're able to do this in like a few days we train a totally new model with reinforcement learning that controls the policy of the robot and a totally new model for vision that is able to localize the object inside the hand we try running it on the real robot we observe that it fails and we observe how it fails and then try to improve it and repeat so with all of these systems like reinforcement learning with robotics with deep neural networks with physics emulation all of these are sources of complexity so in all cases largely we are trying to focus on building the simplest thing Network in the simplest possible solution so one of the things we did was we started very ambitiously and then eventually had to break it down to a much simpler task we initially started with trying to achieve six degrees of freedom on the object so not only where is it not only what direction is appointed but where is it like lifted up off the palm and things like that we simplified that to be just rotation major axis aligned rotation just like get the x axis to be up then we just tried spinning it around Z and then when we had trouble doing that we just tried reaching the fingertips to arbitrary positions in space um eventually we were able to climb back up this ramp but this was a big part in like unlocking a bunch of research project progress quickly earlier in the project another sort of thing we learned along the way is that you have to try lots of things at once in the blog post and the paper we describe two of our vision tracking systems but we don't really describe all the ones that we tried that didn't work here's just some I'm probably missing some we tried opto optic tract like retro-reflective infrared tracking dots these are common in the motion picture industry we tried depth cameras like a real sense or Kinect we tried magnetic treat field tracking like Palomas or like the sort of early virtual reality controller setups we tried active illumination targets the face base those are those red dots that you see on the fingertips that actually did work we tried fiducial and barcode tracking like a ruku and finally we tried the vision cameras and while the paper sort of like is here's the simplest thing that worked there's quite a lot of things that didn't work ahead of that and the final research ingredient for like how we were able to solve this task was lots and lots of domain randomizations we found that like instead of trying to accurately model the robot simulating the robot with all of its basically bagramon model defects would have been effectively impossible and definitely would have run much slower than real time and we want a fast simulator so we added lots of domain randomizations where instead of accurately modeling the world you just have to model the noise of the world and some the noise the world is more efficient to sample from so for example we don't know the exact friction on the hand we had actually we didn't even really know what units of friction were or what same values were so additionally there are things on the hand that might approximate themselves as friction like ridges due to machining or a little screw holes that objects could get caught in so additionally there's different types of materials so instead of like trying to accurately model these we just took all the parts of the hand which visually you saw sort of like being different colors and we assign them all very different frictions during this time so somewhere along the project we actually started we contracted a professional roboticist to come in and fix the hand whenever it broke because we were breaking it so often and when we told them what values of friction we were using they were surprised it worked at all but again these these simple systems are able to learn very robust policies we also tried adding a glove to the road trying to solve it the other side so instead of solving it with the AI solve it with the physical world and it turns out gloves didn't end up working which is sort of surprising to us but the domain randomization did so here's a quick overview of the system we

Overview of the System

built a bigger graphic sort of from the one you saw from fish earlier we have on the top left we have our policy which is a very simple recurrent model and then in the middle we have our vision model we trained both of them separately they're not actually trained together we tried that once and it turned out to not help so it's computationally cheaper to train them both we did that and then when they're rolled out on the real world we just sort of like slap them both together we take camera images from the camera we pass them to the vision model it generates observations for the position of the object we add that to the sort of observations to the robot give it to the policy act and repeat for the neural network nerds out there here is a rough

The Vision Model

diagram of the vision model it's very simple we like we're sort of surprised this worked sort of like fish described three combinational towers that all shared parameters you slap them together and then you guess the position from and rotation off them and fishes work did a bunch to like improve this result it turns out that we were able to just use this and get it to work on the real robot the policy

The Policy Architecture

architecture is also very simple like normal actor critic setup we have noise observations and a goal that are given to the actor is the interesting one the one on the left because that's the one that actually runs on the real robot it's a Floyd connect layer and hose to him and the output actions that's all it took to achieve something that the field of Robotics hadn't been able to achieve before and then the value network has more observation that's not noise so it has basically more things going on but mostly it's used to calculate advantages during training so it's not actually used on the real robot and we don't care that it doesn't have to model the domain randomization noise as much so this the

Training Architecture

training architecture is like an interesting part of the paper most of what's involved in this is how you get to training a simulated robotics policy on more than 6000 CPUs and eight GPUs most of this is in support of just being able to do that training this turned out to be the simplest thing that worked and not only was it for us we actually sort of inherited it from a different team at open AI a system we call a rapid and it's the same system that powers the dota bots so they have a lot of things in common in our paper we compare both software actor critic and PPO but the PPO is basically the same PPO that's playing very well against really good players that dota 2 and then for our robotic specific system we have our robotic specific things and for the dota specific system and they have their data specific things sort of alright results what we found out I guess the most important result is that yes we are able to do this task we were able to do simple object manipulation we can learn a policy from a division model purely from simulated data which transfers to the real robot hand and then sort of our sub results of other things we talked about is site volt earlier we're able to act from sparse observations so the actor again the policy networks that thing on the left is the one that runs on the real robot that little X with note 4 is interesting it's listed in the paper we meant to include this we didn't as a software bug and it still worked so I was kind of surprising it turns out that the we talked about some of the surprising findings in the blog post there's a lot of things that like are counterintuitive to what traditional roboticists would think that we are able to figure out I'm going to go through these real quick we are able to track objects from cameras we have a very low positional and rotational error lower than real images part of it is that gathering real data is really really hard and unstable if you like bump something in a set up where you change the how it curtain is hanging your real data gets old fast but simulator keeps on going we're able to manipulate different objects we were able to manipulate this octagonal prism we tried a couple other objects that around large round objects it has trouble manipulating we're still trying to understand that we're able to train in a simple simulator even with all of our innovations it still trains we're able to show that the randomizations that we added improved performance in the real world I'm just going to skip this because I'm out of time having an LS TM is better than having a confident is better than just being fully connected running with more GPUs gives you better performance yeah oh man there's lots and lot yeah they're all enumerated in the paper the short answer is as many things as we could for the visual we actually used unity which is a video game renderer instead of the default render because video game render is give you lots like in pursuit of being more photorealistic they give you many more dials they you can adjust the metallic nasur glossiness or like the color of the reflectance we randomized everything we could on the physics side we randomized as much as we that was a more manual process it turns out physics emulators mostly want to be accurate they don't want to be inaccurate um and a bunch of the effects were expensive to model so we tried to simulate backlash and we explained a little this in the paper we don't get it exactly right modeling backlash is a really hard problem but we can sort of randomly move motors in the opposite direction and that's close enough sometimes and so the face base dots the dots on the fingertips if you curl them all the way in they can't be seen by the vision system and they go away and it's hard to model exactly which states those dots disappear in but we can just have them disappear for 25% of the time and it roughly approximates it so yeah many of it is the things that we can figure out that are easy to randomized we just randomized by default and for things that we think would help them that we noticed on the robot and have hypotheses about will you manually add every single one does that sort of answer your question some so some randomizations help some randomizations hurt some randomizations sort of break-even in basically every case were able to solve the task in simulation this is something that's sort of different than like normal academic reinforcement learning in like mature or simulated physics worlds is they care that they can solve in simulation basically everything we did we were able to solve in simulation the only thing that counted is if it worked on the real robot and we're very limited by how many runs we can for every trial we would have to run our baseline policy so a policy with known performance just to make sure the robot wasn't broken in was behaving the correct way for a bunch of times and then we'd have to run our experimental policy a bunch of times there's a bunch of domain randomizations we're not sure if they improved because we added them all at once and they helped so we kept them all

Другие видео автора — OpenAI

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Дайджест Экстрактов

Лучшие методички за неделю — каждый понедельник