# Social learning in independent multi-agent reinfor… | Kamal N’dousse | OpenAI Scholars Demo Day 2020

## Метаданные

- **Канал:** OpenAI
- **YouTube:** https://www.youtube.com/watch?v=Qy9J5519s68
- **Дата:** 09.07.2020
- **Длительность:** 23:05
- **Просмотры:** 4,940

## Описание

Learn more: https://openai.com/blog/openai-scholars-2020-final-projects#kamal

## Содержание

### [0:00](https://www.youtube.com/watch?v=Qy9J5519s68) Introduction

hello everyone I am excited to be presenting my scholars projects I can focus on social learning and independent multi-agent reinforcement learning so

### [0:17](https://www.youtube.com/watch?v=Qy9J5519s68&t=17s) Why social learning

groups my interest in social learning came through reflection on how it is that I as a human have the capacities that I do so if I had happens to have been born in the woods away from all other humans I would probably just have like quickly starve to death but thanks to my ability to tap into cultural knowledge I have the potential to do all sorts of awesome things like participate in a space program or lie in bed all day and browse Twitter and I think if one were to if one were an alien who just appeared on earth and saw an example of a human in isolation I think it would be very surprising to see the broad variety that of behaviors that groups of humans are able to exhibit or that individual humans can do if they can tap into that cultural knowledge so yeah because of the centrality of social learning to the human intelligence I think it's important to understand the circumstances in which social learning can take place and so in order to sort of experiment with this there's a cool anecdote from experimental sociology so basically a group of monkeys were put in a room along with a ladder and some bananas were suspended from the ceiling such that they could be accessed by a monkey who climbed the ladder but were otherwise inaccessible and really quickly do you won't be cheering your slides so that was the image you're talking about yes you

### [2:13](https://www.youtube.com/watch?v=Qy9J5519s68&t=133s) The Monkey Experiment

you apologies is that working better that's cool yeah so um yeah there's a group of monkeys in a room they can only reach the bananas using a ladder and anytime a monkey climbed the ladder to access the bananas experimenters would spray the rest of the monkeys with cold water so the other monkey's learned that they should you know beat up the monkey did that prior to climb the ladder in order to prevent themselves from getting sprayed and so this behavior persisted even after the monkeys stopped being sprayed with water and even more interestingly when new monkeys were introduced into the group after the water spraying it ceased the new monkeys of course would try to get to the bananas and then the other monkeys would beat them up so they would learn not to access the bananas not to get the bananas but they would also learn to punish other monkeys that tried to get the bananas this became like a cultural phenomenon among the monkeys so as it happens this experiment is apocryphal and did not happen but I think it's still serves as an interesting template for how we can try to understand social learning so the question I'm interested

### [3:29](https://www.youtube.com/watch?v=Qy9J5519s68&t=209s) The Question

in answering is that of whether independent reinforcement learning agents can learn from each other just by virtue of the fact that they exist in the same environment and can maybe observe one another and I think this is an important question because in reinforcement learning it becomes more capable it seems likely that there will be many environments in which many reinforcement learning agents might interact so for instance stock trading autonomous and adaptive robot X trading stocks in a market and so it's clearly important to understand the circumstances in which they might learn from one another and exhibit behavior that we might not expect if we were only looking at one of them in isolation so I will break my talk down into two

### [4:17](https://www.youtube.com/watch?v=Qy9J5519s68&t=257s) Outline

parts first I'll discuss the tools that I use to approach this question and in particular the environments and reinforcement learning algorithms that I used and then I'll talk about some actual experiments about learning from experts so I developed a an open-source

### [4:38](https://www.youtube.com/watch?v=Qy9J5519s68&t=278s) MarlGrid

grid world implementation called marl grid which is fits the standard open a IgM API it's easy to extend so it's easy to put a bunch of a large number of HS in the environment and it's very configurable and there are also some registered environments so that for reproducibility and given how obscure this domain is I'm surprised that it's already got a little bit of traction on github and this is an example of visualizations that I've built these agents are effectively untrained but it's easy to include a lot of them in the environment and visualize what each of them is doing

### [5:31](https://www.youtube.com/watch?v=Qy9J5519s68&t=331s) Goal Cycle

and the particular scenario that I spent a lot of time working with I call goal cycle so in this environment there are a number of gold tiles and agents in the environment are rewarded for traversing them in a certain order and they're penalized any time they mess up that order and it's one can experiment with this particular environment the one that I'm trying here by installing it with a Python package from github so the that this environment is kind of like an analogue to the room with the monkeys so the reinforcement learning agents that exist in this environment can observe one another and in principle interact with one another there are a

### [6:24](https://www.youtube.com/watch?v=Qy9J5519s68&t=384s) Penalty

couple interesting things about this environment the penalty is configurable and changing the value of the penalty changes the difficulty of learning to explore the environment effectively when the penalty is low the agents kind of ignore the penalty incurred by stepping on the tiles out of order so on the video on the left the agent is not cycling through them in order and anytime the agent steps on a tile out of order it's color resets to red when the penalty is very high exploration is costly because the occurring the penalties is aversive and the agents learn to step on the first tile where they get a reward and then they just avoid all of them so by controlling again by controlling the value of this penalty we can change the difficulty of exploration and in the context of social learning we change the difficulty of learning the effective strategy directly from the environment as opposed to learning it by observing other agents

### [7:34](https://www.youtube.com/watch?v=Qy9J5519s68&t=454s) Reinforcements

and then the other big tool was reinforcement learning algorithms that I used so I started by implementing dqn which like pretty standard for this sort of simple environment but I needed to add memory in order for the agents to be able to learn strategies that unfold over the more than one time step this didn't work super well and I spent a lot of effort trying to improve it notably a

### [8:04](https://www.youtube.com/watch?v=Qy9J5519s68&t=484s) Improvements

limited prioritize experience replay which is kind of tricky with the addition of the LS TM that it still didn't work very well and i sorry implemented PPO and immediately found a pretty big improvement but further I found that carrying over the some of the tricks from the architect to implementation and notably refreshing the hidden states that are collected during the environment over the course of update steps significantly improve the agents capacity to use their memories to accomplish tasks and these diagrams show or these plots show the difference that it made for a simple goal cycle environment where the agent is learning to traverse the goals so basically when this trick is applied the agents are able to achieve much higher rewards and their training is much more stable so you have to recap a

### [9:10](https://www.youtube.com/watch?v=Qy9J5519s68&t=550s) Recap

large part of the effort of the project went into developing the reinforcement learning algorithms and environments that allow agents to effectively learn tasks that are amenable to the kind of experiments that I will discuss so um

### [9:31](https://www.youtube.com/watch?v=Qy9J5519s68&t=571s) Revisiting the question

revisiting the original question I'm interested in knowing when independent agents can learn to can learn from experts to accomplish tasks or can acquire skills from experts so what this might look like is we might have a bunch of experts who have a high level of skill and a novice who's introduced to the environment initially is there an unskilful but then is able to get to the point of expertise just by observing the experts and we'd also want it to be the case that the if the novice was alone they would be unable to learn and their skill would remain low so there is a paper that

### [10:15](https://www.youtube.com/watch?v=Qy9J5519s68&t=615s) The paper

addresses a question like this it called observational learning by reinforcement learning by divorce' at all from deep mind and in their paper the experts are hard coded and novices use RL to accomplish a task in a simple grid world so the diagram on the top shows like a bird's eye view of the map the expert in blue optimally travels to a goal which is which at each episode is placed randomly at one of these sixteen positions and the novice needs to learn to get to the goal as well here's an image of the video of that they found that the experts help the novices learn more quickly but the presence of the experts even in the novices say the experts don't cause the novices to do any better ultimately than they would if they were learning alone so I

### [11:21](https://www.youtube.com/watch?v=Qy9J5519s68&t=681s) The takeaway

started by trying to replicate the first finding in a simple cluttered grid world which is like the goal cycle grid world's I showed earlier but where there's only one goal and found that found very convincingly that the presence of experts didn't help the novice agents learn to accomplish their task any more quickly and the takeaway here is kind of that it's like hard to learn from social cues in these environments but that doesn't prove that

### [11:59](https://www.youtube.com/watch?v=Qy9J5519s68&t=719s) The goal

it's impossible and in order to look in a more targeted way for the circumstances in which this might happen my effort shifted to different environments and in particular the goal cycle environment so the goal of my experiments has been to construct a scenario where in contrast to the oyster results novices and experts are the same sort of agent so they're both trained by the reinforcement learning where solitary novices struggle to learn and where the presence of experts helps and ideally we'd want the novices to be able to themselves become experts so that we can see that they like have mastered the skill and as a bonus ideally the whereas in the borsa case the information that the novices get from the experts is or there's not all that much information that the novices can get from the experts because the goal is unlike one of sixteen places and the novices could just like memorize the potential places we want something that looks a bit more like a skill and so in the we get this in the goal cycle environment because the process of spawning in a new environment and trying out the different possible cycles until identifying the correct one is more is a closer analogue to skilled and just like queuing as to which quadrant their goal is in or something like that so I found that when the golf cycles are masked from the view of novice agents novices do in fact learn to follow experts and this is consistent with the results from Porsche so in both of these videos exhibit this behavior the novices are shown in the bottom of the columns on the right and yeah in both cases the novices are doing like a really robust kind of like the following behavior yeah here the one of the experts happens to have spawned in a trap basically and in these cases because the novices is the novices are just following the experts they end up converging to slightly lower performance than the experts as you can see in this graph so the so far the conclusion that

### [14:29](https://www.youtube.com/watch?v=Qy9J5519s68&t=869s) The next steps

are drawn is that it's like very hard to learn from whether it to learn from experts and when it's possible to acquire a skill directly from the environment it is likely that agents will do that so in order to the next steps for this project which I'll continue working on focus on trying to create environments where the social that were the information available from the experts is more valuable cue as to how to obtain a high reward than the information available directly from the environment and so I plan to increase the number of goals and experiment with different penalty values and so on also the in the example that I showed the following behavior while it does help the agent a crew more rewards isn't quite the same skill that the experts are showing going back to the monkey analogy we want the novice agents to be doing the same thing that the experts are doing exhibiting the same skillful behavior and so a better way to measure that would be by looking at the performance of the agents when they're moved to a new environment without agent without experts and another approach is to add mechanisms to encourage agents to learn socially it's not clear for instance to what degree humans are social learners because of like biological because they're biologically predisposed to do so as opposed to because of the environments that they're in obviously by comparing to animals we might expect the former but yes so we can introduce we can similarly introduce like these priors into agents and then we can characterize the emergence of the social behavior by varying or turning down that fryer so yeah I'd like to thank my

### [16:37](https://www.youtube.com/watch?v=Qy9J5519s68&t=997s) Thank you

mentor Natasha who's been incredibly supportive and incredibly helpful in both helping me like make the best use of learning resources and helping me engage with the broader research community I'd like to thank the program coordinators Mariah and Kristina for helping the program grow smoothly even in light of the pandemic I'd like to thank my fellow scholars for a lot of incredibly informative discussions and yeah just generally being extremely supportive special shout outs to wince and biases for helping me keep track of my experiments and also to Alethea power for lending me a graphics card that I've been using for some of these experiments

### [17:28](https://www.youtube.com/watch?v=Qy9J5519s68&t=1048s) Questions

yeah so I have time for some questions so the first question is kind of novice become more expert than an expert such that other experts learn from it that's a great question in the experiments I've been doing the experts continue to learn alongside the novices so here for instance in this plot the experts are still learning but because in this environment they happen to be close to optimal so we don't see much change as they continue to adapt but in principle yes this could happen I think another interesting direction is for understanding like social behavior and independent multi-agent reinforcement learning is to carefully study the impact of just like learning in a group which is kind of similar to that cool so another question is could you elaborate on hidden state refreshing in your agent when do you refresh the hidden state and how old does it differ from the r2d2 approach so the agents that I had trained so I trained a lot the agents of PPO with PPO agents alternate between collecting experiments collecting experience in an environment and updating based on that experience so during the update phase the agents sample their experience and perform a bunch of small updates based on that batch of experience before discarding it at the end of each update so the hidden States so the typical way that the agents are in typical PPR lsdm implementations the agents will save their hidden states as they interact with the environment so this is like remembering what was in their mind alongside the experiences and then they will sample those as they are doing each of these like little updates but the nature of the experience that they collected depends on the values in the hidden state and depend on their parameters so as they update the hidden states the behavior in their the behavior that they're learning firm becomes less and less representative of the earth is the big divergence between the behavior and there between the data and their experience and the parameters of the current values of the parameters so I found that it wasn't too costly to do this and I have some tweaks to my like lsdm implementation that facilitate this and in the end I end up refreshing them basically between each iteration each gradient step and the r2d2 approach differs in a few ways the reason for those differences I think is mainly that the r2d2 approach is off policy and so each gradient step has the potential to or that the volume of experience that can go into each update is much larger and so because of this they need to employ some tricks to make sure that the hidden states don't get too stale without refreshing it between each iteration because that'd be very costly but for PPO and on policy reinforcement learning it like didn't matter too much another question is why do you think that proximal policy optimization worked so well that's a good question I think let's see so I have been thinking a bunch about this and I think that a lot of it in practice comes from the fact that I my implementation of PPO is based on the spinning up implementation and I guess spinning up also deserves a shout out and so it inherited a lot of tweaks that helped help the agent learn stabili and perform well and it is possible that if I yeah so I hesitate to say that PPO is better than the arc um that's certainly my experience but I think I inherited a lot of improvements from the implementation that I based it off of and then yeah the hidden state refreshing I think is interesting it is yeah it helped immensely with robustness and yeah I think the reason is that it prevents the policy for making big changes over the course of each update and this yeah helps it helps ensure that the policy is consistent with the data that it's learning from I guess I would be interested to for some clarification on that question but

---
*Источник: https://ekstraktznaniy.ru/video/11593*