# Robust Vision-Based State Estimation | Hsiao-Yu 'Fish' Tung | 2018 Summer Intern Open House

## Метаданные

- **Канал:** OpenAI
- **YouTube:** https://www.youtube.com/watch?v=hyZ1sRfRbgM
- **Дата:** 11.09.2018
- **Длительность:** 16:46
- **Просмотры:** 5,360

## Описание

Hsiao-Yu 'Fish' Tung's presentation about her work during her summer internship at OpenAI. Recorded at the OpenAI 2018 Summer Intern Open House on August 16, 2018.

## Содержание

### [0:00](https://www.youtube.com/watch?v=hyZ1sRfRbgM) <Untitled Chapter 1>

hi everyone my knight is fished home I'm currently a PhD student in Carnegie Mellon machine learning Department I'm interning the robot II teams working with Peter and watching today I'm going to talk about robust vision based stated estimator so the goal for the robot II team is to actually build a very strong robotic background such that we can use it to solve real AGI you know we are focusing on develop in general learning based algorithm such that it can works on a diverse of robotic tasks in order to make sure the algorithm we develop is general enough we need to pick up like a pretty hard task in order to make sure this algorithm really work so the task we pick here is to have a shadow hand rotating object to a desired pose so a

### [1:05](https://www.youtube.com/watch?v=hyZ1sRfRbgM&t=65s) Shadow Hand Robot

shadow hand robot is a robot that looks exactly like human hand it has five fingers it has several joint it can do very delegated posts this task is particularly difficult because if you as a human engineer to hard-code this high dimensional control it is nearly impossible but before the robot can start moving his fingers and try to solve this task the first question we want to answer is how do then the robot know where is the object and what is the object is posing in other words how do the robot know what is the position and orientation of the object in robotics and reinforcement learning we call these two state to ask the main States you can actually use a 3d tracker to you need to attach some markers on the object you are interested in we put our robot in a huge cage where we can surrounded our robot with a bunch of 3d sensors this sensors will read off signal from the trackers and then tell you what is the street location of this markers so as you can see this method actually can give a pretty accurate object state but it is pretty expensive and it kind of restrict the robot to only see object in this cage and besides if today I have a new object I need to ask somebody to label the marker first for me which is impractical so here we want to use like a more biologically inspired solution which is just use cameras so we can set up three cameras on this cage the cameras are looking at our robot and our object and from there we try to infer the state of the object so there are several benefit of building

### [3:00](https://www.youtube.com/watch?v=hyZ1sRfRbgM&t=180s) Benefit of Building a Vision Based State Estimator

a vision based state estimator it can be more flexible as cheaper is easy to set up and it's really what we want to feel on our robot so to summarize our environment look like this you have the cage you have the robot in the center you set up some camera that is looking at your robot and your object and from this cameras you captured three images that is looking at the scenes from a different point of view and then the question we want to ask is how do we go from these three images to the state we want to know so now I introduce our method our model is a tip neural network we say neural network is super powerful because it can generalize a lot of tasks achieving state-of-the-art results and most importantly it doesn't require the engineer to hard-code the features so before going into how to train this neural network let's go into detail of how this new an hour actually looks like so our now we're taking three images as input we will pass each of the image into several convolution layers and fully connect layers and then we aggregate an output from this convolutional towers to predict our final object state to train such Network you need aside from the network itself you two additional seen one is a large amount of data that contains a lot of example of what are the inputs and what are outfits like for here our inputs are the three images and our output is the state of the object so then the question is how do we get this large data set we say it is impossible to get it in the real world although we are able to get the images we are not able to know where are the objects so in order to solve this problem we actually use a simulator in a simulator we build a very similar robot a very similar object and we also set up three very similar cameras that is shooting from right top and left and then from this simulator we can actually also read off ground true state of the object so our training data actually looks like what is on the right hand side that we have three images pair with the ground true state of the object besides that we can also because it's in the simulator we can easily change the texture and color of our robots we can change the lightness we can also change the background so in to if we do that we can get like a very diverse data set that can cover maybe all the situation agent my encounter in the real world so by having this large amount of data and this objective function that basically minimized the distance between your prediction and the ground truth we are able to train the network in the simulator and it works well in the real world so it actually can predict super accurate state of the object in the real world and we are able to use it to really solve something with this state estimator however there's still some problem with our current vision system the problem is our simulator actually need a very carefully aligned camera in in the simulator means your camera need to have exactly the same position as the one you have in the real world so how we get this value is the engineer need to go into the simulator slightly adjust this camera until the images are super aligned so this actually requires extra effort because every time we change the camera rotation we need an engineer to do this again another thing is if during test time someone accidentally touch one of the camera on the cage then the vision system break so here I want to emphasize how serious is this problem so I show some state prediction error on real images using a calibrated environment and not calibrated in problem and you can see on the blue curve is with calibrated cameras and the others are with misaligned camera and the performance degrades a lot so why this is happening because our neural network is actually trying to find out someone useful pattern and from this useful pattern you want to drag LIGO to you the output you want it to be predict because your only objective function is this Euclidean distance between the branches and your prediction so if we sit back and think how people figure out this problem is I have this hand here today I want to ask you like what is the position and orientation of this cube in this 3d coordinate system given only this image captures from some random camera well what will you do you probably will say oh yeah of course today we focus on objects so I definitely need to like focus on the object and you probably will also say I want to know where this images captured from so you might also look at the arm of the robot to figure out what is the camera position combining your object detection and your position of the camera all together you figure out what is the global state of this object so in order to answer this and actually human need geometry knowledge and also human need attention on the right object so then the question is how do we tell the robot that these tools are important and you need to somehow encode this in your solution the solution is because in the simulator we can get a bunch of 3d information we can get whatever branches we want so in order we can actually extract more information from the simulator and then add these two to the neural net we're trying to tell him so let me introduce these two so the first one is we want to

### [9:27](https://www.youtube.com/watch?v=hyZ1sRfRbgM&t=567s) Force the Network To Learn about Geometry

force the network to learn about geometry so we say when we are doing the previous practice we're actually trying to infer where the camera is shooting from so on top of the output there of each of the image I will add an optional task to say oh please predict as well the camera position orientation to make sure that the robot really understand where this image is captured for so biting adding this accurate test we can actually improve the prediction on position of the object so you can see the red line is without this actuary task and the green line is with this after a task so by adding this we are actually forced in an hour general to generate like a more reasonable solution for this the other scene is we want to cast right attention to this task so previously we say oh because we are answering like what is the state of the object so we definitely need to watch the object instead of focusing on the background so we circle the object and we say if I want to answer what is the position of the camera I probably need to focus on the arm of the robot so these tools are the same we feel like the robot should add this focus on so here I asked the robot to further predict bounding box for this to object note that the pounding bass can be obtained from the simulators in a simulator we can get like any kind of cultures we wanted and then from this bounding box location we extract localized features and concatenate lists localized features for our final prediction so by doing this kind of forced attention actually the network can improve on orientation prediction so the red line is a bad sign without any attention the blue line is with attention on the cue and in the bottom line the orange one is actually forcing attention on the cube and also the art of the robot so here is some

### [11:44](https://www.youtube.com/watch?v=hyZ1sRfRbgM&t=704s) Conclusion

conclusion and take away from me is we are proposing like a learning-based vision system that is very robust to the camera position and we also hope it can transfer well to the real setting so previously in our release we proposed this domain randomization where we randomized all kinds of like textures lightness and background in the simulator but here I want to emphasize like in order for our visual model to really understand 3d to understand geometry maybe we can extract more supervision from the simulator the first thing is we can enforce in geometry learning by adding actual tasks like predicting camera bells and we can force in right attention by asking the vision system to predict bounding box and also use the attended feature thank you we have like a 1 minute Q& A oh okay hi oh please you'd still so actually I have tried to randomize the camera position like crazy like the camera can move all around in this space but then this actually decreased a lot on the prediction of overall even in a simulator so that means if you just have this naive model that fit in image passing through some convolution it is now going to figure out this answer and also we try to randomize the camera position in our preview release and it kind of like heard the final performance on the real image so we kind of remove it so the thing is or narrower structure is not good enough so that even if we randomize the camera so much it cannot learn something useful so it needs something else to help it thanks oh one more question Oh does anyone too hi how do you think that your approach would work when cameras not in the hand because you know camera will have you know you have the extra complexity of the depth sensing and so wise differences there are types of wines differences are not necessarily always reflective of real-world difference in distance so yeah how do you think that this would work when the object is further away it's further away so if the object is kind of like it has a distribution that is not in the distribution we train then it definitely will diverge in the real like we cannot make sure or do you mean if we have extra at that sensor well they help so my question is more like do you think the methodology would also apply when you have the object further away such that you have right here is that you want the robot to be able to point to the object yes so it's the state I guess would just be the position of the object so if all the cameras are here and like kind of enclosing the robot there were a lot of hand in a cage and the object is over here yeah sure you can have all the camera positioning and stuff figured out but you want this extra distance to account for right that the depth perception capabilities with the camera yeah so do you think that I think it's like it's more like how do you think that this approach so I think that extra complexity can be solved if you have like an additional camera that is looking at the side so then you can make sure this dimension is correct but the thing is if something is super far away like even for human like if you have a car that is super far away like is cross the street for you cannot predict precisely like what is the distance between you and the car

---
*Источник: https://ekstraktznaniy.ru/video/11613*