# Edouard Harris - New Research: Advanced AI may tend to seek power *by default*

## Метаданные

- **Канал:** Towards Data Science
- **YouTube:** https://www.youtube.com/watch?v=dYSw-SV_fsI
- **Дата:** 12.10.2022
- **Длительность:** 58:23
- **Просмотры:** 1,155

## Описание

What does power seeking really mean? And does all this imply for the safety of future, general-purpose reasoning systems? Edouard Harris, an AI alignment researcher and one of Jeremie's co-founders of the AI safety company (Gladstone AI) comes back on the TDS Podcast to discuss AI's potential ability to seek power.

Intro music:
➞ Artist: Ron Gelinas
➞ Track Title: Daybreak Chill Blend (original mix)
➞ Link to Track: https://youtu.be/d8Y2sKIgFWc

0:00 Intro
4:00 Alex Turner’s research
7:45 What technology wants
11:30 Universal goals
17:30 Connecting observations
24:00 Micro power seeking behaviour
28:15 Ed’s research
38:00 The human as the environment
42:30 What leads to power seeking
48:00 Competition as a default outcome
52:45 General concern
57:30 Wrap-up

## Содержание

### [0:00](https://www.youtube.com/watch?v=dYSw-SV_fsI) Intro

hey everyone and welcome back to the towards data science podcast now as long time listeners of the podcast will know progress in AI has been accelerating dramatically in recent years and even in recent months it kind of seems like every other day there's a new previously believed to be impossible feat of AI that's been achieved by a world-leading lab and increasingly these breakthroughs have all been driven by the same simple idea AI scaling right this idea of training AI systems with larger models using increasingly absurd quantities of data and processing power and so far empirical Studies by the world's top AI Labs seem to suggest that AI scaling is an open-ended process that can lead to more and more capable and intelligent systems seemingly with no clear limit and that's led many people to speculate that scaling might Usher in a new era of broadly human level or even superhuman AI the Holy Grail that AI researchers have been after for decades and as wild an idea as that might sound it's really starting to look plausible that over the coming decades or maybe even sooner we might have a good shot at creating those kinds of systems and all that might sound cool I mean it does to me but an AI that can solve General reasoning problems as well as or better than a human might actually be an intrinsically dangerous thing to build at least that's the conclusion that many AI safety researchers have come to following the publication of a new line of research that explores how modern AI systems tend to solve problems and whether we should expect more advanced versions of those systems to perform dangerous behaviors like seeking power now this line of research in AI safety It's called Power seeking and although it's not currently well understood outside the frontier of AI safety and AI alignment it's starting to draw a lot of attention the first major theoretical study of power seeking was led actually by Alex Turner who's appeared on the podcast before and it was published prominently in nureps the world's top AI conference for example now today I'm going to be talking to my brother Ed an AI alignment researcher and one of my co-founders in the new AI safety company that I'm a part of Ed just completed a significant piece of AI Safety Research that extends current power seeking work and shows what seems to be the first experimental evidence suggesting that we should expect highly Advanced AI systems to seek Power by default but what does power seeking really mean though and what does all this imply for the safety of future general purpose reasoning systems that's what we'll be talking about with Ed on this episode of The Tourist data science podcast and Paris on here he comes to share his ideas about AI safety sorry we got to work on the intro tune but Ed thanks for uh thanks for coming on the show here thanks it's great to be here yeah really good to have you here it's the second time you've been on the show uh first episode that we did with you was uh like two years ago or something we started down this path of exploring AI safety uh artificial general intelligence all the themes we've been hitting on pretty hard in the last uh two years or so and you've been doing some really interesting work in that direction as well we're gonna get into that and I think there's a really interesting piece of work that you put together over the last few months that people would be really interested to anyway to hear about that has to do with this AI safety story it builds on some research that was first presented on this podcast by Alex Turner and this is research about a concept called Power seeking in AI um so if anybody's curious about the idea of power seeking AI we will talk about it today we'll introduce it if you want to check out Alex Turner's episode please do um using that phrase though it kind of

### [4:00](https://www.youtube.com/watch?v=dYSw-SV_fsI&t=240s) Alex Turner’s research

sounds like a science fictiony thing to talk about right I mean we talk about AI seeking power um that kind of conjures up images of Terminator and whatnot which aren't necessarily quite accurate and so I'd love to get Ed from your kind of perspective what like what is this idea of power seeking and what's like the body of research because there really is a body of research that's actually been done probing at this question right yeah uh so the idea behind power seeking uh in general is uh kind of like tries to a map onto how we think intuitively about what it means to be powerful so you can think of it as like uh you know what kinds of things do you do to become powerful uh like maybe you try to accumulate a lot of money some relationships with influential people uh you try to build an audience all these sorts of things and so the key Insight behind uh the research side of power like the actual formalization of this concept has been and trying to figure out like what each of these things has in common like what is the commonality between trying to accumulate a lot of money like trying to get a lot of influential friends all these sorts of things and the Insight is that uh you kind of have to imagine yourself not knowing what your goal is try to set up your life in such a way that you can be positioned to do well on any goal when you don't know what your goal is so you know I don't know like you know and a lot of us have been in this position you know when we're young we don't really know what we want we don't to be become out of life so it's not entirely hypothetical a lot of people are like this for a lot of their lives we don't really know what they want and I've certainly been like that myself and so what do you do right like how do you position yourself um in that way like is my goal am I going to discover about myself that my goal is going to be you know to be a janitor um to be a Tick-Tock star maybe to be the president of the United States any of those goals right so like what are the things that I do and that I strive for when I don't know what my goals are um and those are the things that we talk about as making you powerful so if I accumulate a lot of money well whether my end goal is uh to own a house or again to like be a janitor or a school teacher or anything I want well having a lot of money to my name is probably going to be helpful to that goal so that's kind of the idea of how we think about um power and there's ways of formalizing this um in the reinforcement learning context which is one of the paradigms for um for machine learning and this is actually the work that Alex did uh previously but that's the general idea like what are the things that you do if you don't know what your goals are those are the things that make you powerful and so this is really interesting because I think to a lot of people you hear that and you kind of go okay you know I might see how that relates to humans like you said I you know I don't know what I want out of life and so I don't know I do a college degree and I develop my understanding of the world not necessarily knowing how I'm going to apply it but I know that makes me more powerful in some sense I might try to cultivate friendships with people who already have a lot of resources and that's another form of power and so on but one of the things that you said that really jumped out of me there is you know it's what happens when you don't know what you want and this immediately makes me think like so in the context of AI right the AI systems that we build they know what they want right like we're always training these things with a specific metric that we're trying to get them to achieve or to improve um so how then how does this actually apply to AI because it sounds on the surface like these should not actually uh apply one to the other right yeah

### [7:45](https://www.youtube.com/watch?v=dYSw-SV_fsI&t=465s) What technology wants

that's a fantastic question and the answer lies in the fact that like things that make you powerful are kind of the way stations for a lot of different possible goals so again like the way you get to power conceptually is by thinking like well you know how do we you know suppose I don't know what my goal is like what are the things that made me powerful um but one of the consequences of thinking in this way and like your college degree example is a great example you know the college degree is like something that uh that you do if you don't know what you want out of life but part of the reason why it's something that you do if you don't know what you want out of life is that it's an enabler for a lot of many different Downstream goals so you know you could become a software developer with an engineering degree you could become you know an artist with an arts degree all sorts of things and like but going to college that is an enabler of a lot of different goals Downstream and so thinking of it from the standpoint of an AI like if I give and actually not yet so starting to start anything about it from the standpoint of a human if I give you a goal many of the different possible goals that I could give you as a human will involve you going to college and so college is like this choke point right that many people flow through because it enables power Downstream and similarly for AI systems we believe that there are going to be similar things that AI systems will want to do and all like kind of converge around because those things are enablers for a lot of different Downstream tasks for example um an AI that has like most possible goals you can imagine will want to like continue to function it will not want to shut itself off so if you give an AI a goal of like take out the trash every night well uh the AI can't take out the trash if it's turned off so you know you can imagine it wanting not to be turned off um and similarly if you give this AI a goal of like um you know uh win at chess like develop this like really high score at chess uh again the AI doesn't want to be turned off because it wants to keep playing chess to develop the high score at chess and so these kinds of goals like not being turned off for an AI um and uh and going to college for a human these are what are called convergent instrumental goals because a lot of different um people and objectives Converge on trying to do this kind of instrumental thing and incidentally human beings also don't want to be turned off we have this very instinctual you know fear and avoidance of death I don't want to be killed and that's also um believed to be a convergent property we many different goals that we have for all of those goals we accomplish them better if we're not dead fasting so is there one would one way to like Express this idea be to say that you know almost no matter what your goal is in life objective is if you're an AI system that there are certain sub goals that are always going to be appealing to you like not being turned off is always going to be appealing to you collecting there's no circumstance in which a human wouldn't want more money no matter what you want to get out of life it's not going to be harder to get with 10 million dollars in the bank likewise you'll never want to be Dumber like is this a fair way to kind of characterize this it's like the goals that are almost like universally useful no matter what your true end goals are pretty much you could imagine like there are certain goals that you know maybe don't pull in the you know don't die kind of sub goal um you can imagine for example like

### [11:30](https://www.youtube.com/watch?v=dYSw-SV_fsI&t=690s) Universal goals

there are examples of people sacrificing themselves for a greater cause in all kinds of contexts in history um but by and large I mean that pull of not dying right like that's a pretty strong pull and it holds like pretty universally for the vast majority of people so you can maybe say it's not like always but it's a very strong pull and it's believed that for theoretical reasons actually partly um uh some of the work that Alex has done in the past um suggests that many of these goals indeed do get pretty strong okay great so I think we have like maybe a rough sense of what these kind of convergent as you say convergent instrumental goals are these goals that for most objectives again I'm going to check in with you to make sure this is correct framing but for most objectives that we could give an AI system um we will find that system converging on these goals these things like staying like not being turned off uh things like aggregating maybe resources and things like that um now this it does I see you nodding so I'm gonna assume that's fine for now this does then raise in my head the question of you know AI systems and like how we would ever prove or at least show experiment because this is like theoretical right you can imagine saying this to somebody be like okay you know that makes sense on paper but the idea of an AI that's actually going to do this is actually going to seek power that's actually gonna like try to prevent us from turning it off I mean this is the stuff it sounds like the stuff of Science Fiction um so but there is a world of kind of experimental evidence behind this a bunch of research at least that's been done I'd love to hear from you like what is at least Alex take Alex's take on this what's the background that you're waiting into yeah so uh first off on the context of your point you're absolutely right that what you like this whole idea of power seeking four AIS um there's evidence for it you can kind of think of the intuitions that we just discussed but it is still a very much like hotly debated um there are folks who are like yeah no this you know this will never happen like what are the odds that this will happen um and then other folks are like no you know because some of the counter arguments are that well you know humans have Arisen through Evolution and evolution as a competitive process and so therefore maybe that's what's been driving these competitive forces and you know that maybe that's a reasonable argument but then the counter argument is that no you know these seem like actually pretty General Trends and more General things and so this debate has been uh has existed in the field of AI safety for you know a good 10 years 15 or so years like quite a long time um and then you're asking now uh well you know what has the evidence shown recently what is what has been spoken to you about this recently um so uh the work that Alex has done um which like the my recent work here is based on um has been entirely theoretical so it has been first off around like defining power so you know we were talking about like roughly how this works like how do you define that mathematically in such a way that you can actually calculate it and you can give me a number like your power is like three on in this state of the world or whatever like how do you actually do that nail that down um and so he's managed to do that and that's it's a very uh powerful concept it's based around the intuitions that we talked about just now like so you literally you take your AI you plonk it into um into a world and you give it a distribution large number of different possible goals and you literally just look at like how much it values each state of the world for each one of those goals and you just average over that um to kind of simulate the fact that maybe that you're saying you're AI like doesn't really know what its goal is but it's going to do its best uh regardless of what that goal is and that kind of tells you what is the value of that state over all these possible goals for an AI that is doing its best so to run the Spy then so the idea is roughly speaking like so I take my AI I put it in this like maybe this game landscape or whatever okay and then I tell it okay uh now I'm going to reward you for I don't know getting to this Square in the game or and I maybe I change the square or I change the thing that gets rewarded and then I just watch the AI go through many different iterations many different kinds of rewards and then I just noticed like oh it seems to like to consistently do this or it seems that and are the things that it seems to consistently like to do the things that this power argument would uh predict is that sort of fair to say uh yeah exactly that's exactly right um and so prior or for you know since that work was published um it's been theoretical work uh for the most part just in this field in fact entirely um and it's been work around like uh just a single agent or a single AI wandering around you can imagine exactly like you said um it's you know mouse in a maze kind of stuff with a piece of cheese so your AI is wandering around a maze like maybe you know tile by tile and it's like a it's maybe a pixelated Maze and each of those like spots maybe you've put a reward on like one spot or another spot or you put a couple of rewards of different like strengths maybe like there's a you know there's a piece of chocolate over here and a piece of bread over there and like the bread's pretty good but it's not as good as the chocolate and so the eye is going to wander around in different ways and then what you do is yeah you swap out the rewards like you put them in different spots you change like how strong they are and how good they are and you do that like literally you know tens of thousands millions of times um and then you just average over that and see which spots in the Maze the AI prefers yeah and I'm describing here uh more of the experiments that I'm doing than the theory um but it the experiments are trying to realize the theory and they're actually a good way to understand what the theory is doing saying because this is how you actually Implement and operationalize the theory yeah I mean that's fascinating and so did you do you actually see this kind of power seeking emerge in those contexts like how does that I guess how does that manifest like how do you

### [17:30](https://www.youtube.com/watch?v=dYSw-SV_fsI&t=1050s) Connecting observations

um how do you connect these observations to this argument about worrying about AI eventually doing the same thing yeah so this gets to the question of like well you know there's a theoretical piece of the power stuff that Alex and some of the folks he's collaborated with worked on um and then there's the experimental piece which is what I've been working on for the last few months and what's the difference between them like what's that step so step number one is actually uh taking it from Theory to experiment so actually taking like that theoretical paper uh and implementing what it says in a real world and actually looking at the results um and then the second step which we'll get into later is like we you know we've looked at a player one game well we're gonna look at a two-player game uh which experiments allow us to do which is like where do you get really interesting stuff and you see interactions and possibly competition and stuff but we'll start with that that one player thing and your question to uh okay so you have this AI thing in a maze and he's wandering around and he's eating cheese and you're averaging over the cheese of the candy or whatever uh do you actually see uh this pattern of power that you would expect um and the answer is yes and here we can like dig in a little bit into what exactly you know what should we expect to see like what does it make sense like and so we can maybe do things like think a little more concretely right so if you imagine like um I don't know let's say you've got a maze that's like uh I don't know in the shape of an H or something like that so you just imagine like an H like really simple maze it's like barely a maze at all uh and then you think about doing this thing where you put rewards in different spots in the Maze and so forth so like an intuition here for you know how much power you have as an AI who can wander around up and down in different spots is like well what is the where are the places that allow you to access the most Downstream like stuff the fastest the most optionality is that one of the most options yeah the most options the fastest yeah because like your piece of chocolate or whatever could be you know down in one of the like little uh you know dead ends of the H and one of the four of them it could be in the middle it could be the junction it could be anywhere so you average over all those possibilities and you ask like well if you had no idea where uh it was located right where this like chocolate was located where would you choose to be like dropped in that maze you can kind of think of it like that okay so these like these maze positions that um kind of give you the most optionality I mean I'm guessing it's like the uh the intersection between let's say if it's an H it's the between the vertical uh bar that goes across and the sorry the horizontal in the vertical bar on one side and probably the same on the other because those you could you can go in three different directions there you can go kind of sideways or you can go up and down um and so this okay so this kind of like I'm gonna try to make the jump now so I'll see if you agree but make the jump to sort of human behavior so when I say that I want a lot of money that money gives me a lot of power what I'm really saying is having a lot of money means that um I can as in very next step I could do a much larger range of things it's much more like being at that Nook between the horizontal bar and the vertical bar where I can go in all these different directions rather than not having money which is kind of more like being at the dead end of the kind of say the lower left corner of the H where I can only go up is that sort of like a fair mapping between the two yeah that's exactly right and it's exactly the right prediction for what you actually see when you do this experiment so you see that these agents have more power at the junction points in a maze where you can choose to go in any of you know three or four or whatever directions it's like a it's a junction in the paths you have more options just like having a ton of money gives you more options than what you want to do with your life whereas on the flip side like you mentioned if I'm stuck at a dead end at one of the you know the ends of the H then I'm very sad because most of the possible places that reward could be in the Maze are very far away from me so I'm gonna have to wander like quite far and so you actually see this pattern you see maximum power at the Junctions in a maze minimum power at the dead ends and this is true for not just like an h-shaped maze but like you can think about drawing like random whatever you want with Junctions and you consistently get more power at Junctions and more power at the dead end now interestingly enough this pattern also depends on how far ahead your agent can play so this Junction and Dead End thing that's true when your agent is like pretty short-term is a short-term thinker because when you're at a junction like it's like what are the options that you see immediately well you can go up down and to the right if you're on one of the Junctions up down to the left if you're on the other whereas if you're at a dead end you can only go you know in the direction that is not in the dead end or you can stay where you are so those are the kind of like the these are short-term options and so when a short-term AI a short-term thinking AI is faced with this maze that's the pattern of power that it sees but if your AI is then given a much longer planning Horizon so it's able to plan ahead many steps into the future then you actually see a change and you see Power begin to centralize in the middle of the age so it's no longer at the Junctions it will actually be like right in the middle you can imagine the vertical bar of the H it's at the center of that vertical bar because now horizontal bar do you mean the horizontal bar sorry yes the horizontal bar the H because now the AI is able to see far ahead and think like ah like I'm looking for the one place on this entire like h-shaped maze where I'm going to be the happiest and it's able to see far enough ahead to locate that one place and so you actually see this shift that's right so there's concentration of power the longer the further ahead um the system is thinking and so there is you know there's a potential reason to be concerned about that from an AI standpoint okay well very interesting and we'll get into that side actually maybe this is a good time to get into that part of the story so we've got the situation where um again you know you can check my language on this but it kind of seems like what we have here is a micro World a kind of mini version of uh the

### [24:00](https://www.youtube.com/watch?v=dYSw-SV_fsI&t=1440s) Micro power seeking behaviour

kind of power seeking Behavior basically that we would if you just draw straight lines it starts to look like this would actually yeah generalize into the kind of power seeking that um AI safety people might be concerned about like you know if AI does tend to try to put itself in positions where it has more optionality those positions include hogging resources preventing itself from being turned off and maybe improving itself um then this seems like kind of a microcosmic like a kind of mini proof point for like hey this is happening at these small scales even if we just extrapolate out like do you think that is that like a fairer argument here and by the way if not I'm also curious about like what some of the assumptions are that you see embedded in this like what are some of the places where you might say Hey you know maybe this is not going to turn out we're all going to be lucky and it's gonna be fine and dandy for whatever reason yeah that's a great question and uh yeah and to be clear I mean this uh this isn't this is like I would say is like evidence for that but it's not super strong evidence because again you know the distance between these two scenarios is still very great we're looking at something like you said it's like this toy you know one player maze thing um and so we're and we're trying to generalize that to like complicated worlds with people and AIS and such so there are all sorts of things that may break along the way it's more that you're doing a very simplified case and you're seeing like okay you know um well if this idea of instrumental convergence was true uh you know we would expect to see something like this and of course the reason just to spell it out is that um you know when if uh as you increase the planning Horizon you see Power like concentrate into one spot well you can start being reasonably concerned potentially that like you know maybe multiple different agents will see Power concentrate in the same spot and maybe we'll compete over that spot so that's idea behind a concern from something like this but again there are a lot of um things like missing between here and there it's just like a one piece of evidence so some of the stuff that is uh is some of the assumptions that are embedded in this um first off like again this is a one player game so you know things might change easily if you add a second player um but additional that it may be some slightly more like technical um considerations uh on the one hand like this is maybe a little bit like detailed but if you think about like the the setup of trying to like put these candies or whatever these rewards in the Maze well you kind of have to decide how they're distributed like is it an equal chance of having like a chocolate on every cell well yeah that seems like relatively reasonable um what if you allow the reward to kind of scale you know maybe from zero to one and each cell has a reward of like 0. 2 0. 5 or whatever like how do you decide on the distribution of those rewards on the Maze and that's a decision you have to make when you're setting up this problem and you know there's reasons for particular answers but it's ultimately like it's contingent um and it's possible that certain conclusions may not be robust to different choices and so you kind of have to think about like how do you set this up in a way that most resembles the real world but that is always a little bit front gotcha so some people might disagree with like I don't know you put uh you know three different pieces of cheese in this big maze but like in reality maybe they're in the real world maybe there's more reward to be had uh in every nook and cranny or the opposite um okay great so I think this um in some sense provides background for the meat and potatoes today even though you know we've had a quite a long chat already exposing kind of this uh context um it we're gonna now sort of shift gears into your research because you've finished building on this line of research as you've been hinting at um and extending it in some pretty interesting ways I'd love for you to kind of dive into that like and explain what your research involved and what

### [28:15](https://www.youtube.com/watch?v=dYSw-SV_fsI&t=1695s) Ed’s research

some of its conclusions were yeah absolutely so uh what I've been working on um has been building on what Alex has done which is why I've mentioned him so many times uh already so far uh he of course did this theoretical piece and the definition a formal definition of what power means in the context of reinforcement learning you know AI agents um what I did uh like then from there is basically like three things number one is uh take that theoretical you know concept and Implement an experiment like in code so actually you know write a code base that does that and can Implement experiments with this um the second piece is actually uh extend the definition of power a little bit to also Encompass a particular scenario uh of like a two-player game that we think is uh is potentially relevant to long-term AI type scenarios uh protect for technical reasons that we can get into is actually very hard to extend the power definition to every possible two-player scenario but what we did here is extended it to uh to a particular scenario that we think is Meaningful and relevant uh and then the third part is actually uh run those experiments and try those scenarios in different contexts and see you know what do we get do we get interesting things do we see interesting things can we you know can we actually make interesting and intelligent like draw conclusions from that and it turns out it seems like there are um there are indications from these results that do suggest that instrumental convergence and this idea of AI is all one and the same thing um could indeed be true but again it's okay stuff but uh yeah of course okay so you're running the experiments you're writing the code and all that uh in the middle there you sandwich this idea of broadening the definition to talk about multi-agent problems uh so like I guess what does that broadening look like and why are multi-agent problems relevant to especially like AI risk and future of AI risk like can you kind of tell that story a little bit yeah um so there are a lot of uh places where multi-agent Dynamics enter into uh like powerful AI scenarios um obviously there are many different human beings in the world uh we all you know we all need to agree on the basics of something in a long-term AI scenario if an AI is going to be very powerful and we need to tell it what to do so we need to agree on the basics of that um some people some folks believe there may be you know multiple powerful ai's uh living together at some point but really the one thing like everyone kind of agrees on is well uh in a good scenario um there will be humans around at the same time as there are AIS around or what or humans sometimes there's one AI around and so uh I mean you better have some sort of way to account for uh those two things right there's like the power dynamics basically the power dynamics exactly between humans or a human and an AI at the very basic you want to be able to at least be able to say things about that yeah and so like is are part of those power dynamics uh exploring the idea for example whether there's an intrinsic conflict between uh the like say the power interests of humans and AI that are coexisting and like maybe whether we should expect them to compete or collaborate by default like is that like kind of part of this anyway this ecosystem yeah that's exactly what I was trying to get at in the experiments and to sort of back up a little bit in terms of how this scenario gets structured um so similarly to uh how that original Power scenario was like well you know suppose you don't know what your goal is and you're gonna do your best uh but you don't know what your goal is and then based on you doing your best and you not knowing your goal is Where What Where do you prefer to be that being that sort of single agent definition with this two agent definition we actually have one so there's one agent that one of these agents that we think of as the human so like you know there's a human you know in the scenario and another agent that we think of as the AI um in the AIS in the scenario and uh the way we structure the way we construct the scenario is we say Okay um well uh we're gonna start by like dropping our human by itself in nature so there's and in nature here again we're talking about like these little maze environments right because we can't you know we can't simulate crazy stuff uh but we're doing this in a simple environment so we drop our human in this little maze by himself uh and then we kind of like do the original Power thing we say like all right where you know what do you where do you prefer to be and so on and so forth um and the idea here is that uh initially before the AI enters the picture um human beings learn much faster than Evolution learns so the environment that the human is in can stay static from the perspective of the human learning so the human can actually learn to do his best on a static environment because we assume that Evolution just moves very slowly compared to how quickly humans and human civilization learns so that's like the that itself is an interesting point right because like the environment itself it's kind of I don't know if the right term is like an agent but it's definitely something that can change in response to like you say like selective pressure from Evolution and it is dynamic but it's always a question of like Dynamic relative to what and as humans run around in nature like yeah you know we've got trees that have been sitting still since like before the American Revolution entire nations have risen and Fallen technologies have been invented um the future of trees is now entirely at our whim and trees are just kind of sitting there clueless like not like no response to it so I guess that's kind of the reasoning behind the environment being static and then the human agent being able to yeah so the environment is responsive both in the real world and in the simulation the environment like can still do stuff at the speed that a human does stuff but um the environment in reality the learning process where that the where like nature the way nature learns really is through Evolution and that is a slow process that's like compared to like how quickly humans operate that's like you know a million years and a week right like we can do things in a week that it takes evolution of millionaires to do so from that perspective that's the reasoning behind like oh you know let the human uh optimize super fast compared to the um Nature's optimization so then what we do is we bring an AI into the picture like the second agent and what we do is we freeze the humans policy like the so the human is still again acting um fast but the human is no longer learning so we freeze the human um in the scenario and then we bring the AI in which is the second agent and we have the AI basically learned to do its best assuming that the human has stopped learning and so the idea here is again it's the same assumption except like one level up right the same way that we can do things in a week that it takes an AI a million years sorry it takes Evolution uh Evolution a million years to do in the same way right because AI as we assume is going to operate on you know machine time which is driven by Electronics like you know computer hardware very fast compared to us um what takes us a week you know just with electric electronic surrogates might take a couple of seconds or less for an AI and so you think of it the same way we are these kind of slow glacial you know every movement of ours is an age from the point of view of something that operates so quickly and so we're saying that like you know start with nature take the human optimize on nature freeze the human and then take the AI optimize on the human and that's the construction that allows us to do these power calculations with this two agent thing and is also relevant to AI alignment because it's designed to at least be an approximation of what a scenario like this might look like and when you say the like say you take nature you let the human optimize on nature so here I take it you mean like you let the human um be a reinforcement learning system that like basically through not trial and error but through like the standard reinforcement learning process learn strategies that seem to work well in this environment and so you let that happen while the environment stays static um and then you kind of you say okay whatever strategy that human has learned from that experience that's the strategy we're now going to stick with that's going to just be the human strategy uh forever and now we introduce the AI basically now it gets to Optima it gets to actually kind of interact with the human and the environment together and figure out its strategy for that combined system learning all the while kind of in a way I guess learning in a way that out thinks the human or not out thinks but like yeah kind of like while the human is not learning anything is static just sitting there like the environment was for us kind of like learning this strategy that treats in a way we kind of become the human becomes kind of like the environment um or part of the environment

### [38:00](https://www.youtube.com/watch?v=dYSw-SV_fsI&t=2280s) The human as the environment

that's right exactly uh and in fact uh yeah that's exactly right and so you're asking questions in that scenario that's like well you know how does it feel from the human's point of view to be operating right in this environment that you are no longer optimized for and uh the interesting thing about this is that uh by introducing this AI now you have these two agents you can start to ask really interesting questions about alignment so we you know we listeners of this podcast probably have some sense of what it means when we say you know a line in AI so in other words give this AI goals that are similar to mine to make sure it's truly doing what I want um but you can start to do interesting things with that and the reason why is that if you remember um we talked about you know how power gets calculated so it's like well you know you don't know what your goal is and so you have some like distribution of goals and similarly for the AI we assume actually AI doesn't know what its goal is and so we try a whole bunch of different goals and all of that but what's interesting is that you can actually uh test out different statistical relationships between the human's goal and the ai's goal so you can test out a relationship where you know a like relationship number one is like yeah the human doesn't know what it wants and yeah the AI but the eye and the human always want exactly the same thing so with this piece just to kind of go make concrete here like I'm thinking back to your Maze and cheese analogy so like let's say um you know the like uh negatively correlated rewards might be like there's one piece of cheese and if the human gets it the AI won't get it and if the AI gets it the human doesn't get it um and but then a positively correlated reward is something like there's a piece of cheese and whether the human gets it or the AI gets it uh both benefit get like I don't know plus one cheese reward or something they're both happy exactly that's right so it's like yeah there's a reward on the Maze of the piece of cheese whatever uh and exactly if the rewards for the human and the AI are the same then it doesn't matter which one of the two gets the cheese right both they're both happy if one of them gets the cheese it's like I you know my AI loves me so much that if I eat the cheese it's just as happy as if it had gotten to eat the cheese itself and you can do this uh you news with any level of correlation right so you can think of it like if I eat the cheese well now maybe my AI is like only 80 percent as happy as if it had eaten okay so it's like it would rather eat the cheese itself but if I get to eat the cheese it's pretty so pretty happy it's pretty happy um and of course it's all you know you can do this uh you can do this statistically again it's all in code so you can do whatever you want and you can have something that's like well you know um if I eat the cheese maybe like there's an 80 chance that the AI is just as happy as me and there's a 20 chance that it does not care one way or the other that I ate the cheese uh maybe there's another chance that it hates the fact that I ate the cheese it really you know doesn't want me to be happy in practice in terms of what we actually investigate uh we invest we investigated in the Spectrum I was looking at the Spectrum between uh you know AI does not care uh at all what the human wants all the way up to the eye and the human both want exactly the same thing and the reason why I focused on this part of the Spectrum in particular is that um we at least can hope that human beings are good enough to get AIS to do what they want that we at least are not like actively making our AIS do things that we don't want um and that's the domain of a scarier attack of risk where the eyes are actively moving against us but uh but we care about neutral to good right now that's really interesting because um you know naively I would have expected if you have an AI system and like the AI system you know maybe it's not a hundred percent thrilled if the human gets the cheese first maybe it's just like you know 20 thrilled that should still be good enough right like you kind of think like oh well they still kind of want the same thing like shouldn't this be fine so are you saying that itself leads to um potentially What conflict or power seeking behaviors from the AI that are undesirable from the human standpoint yeah one of the most interesting early

### [42:30](https://www.youtube.com/watch?v=dYSw-SV_fsI&t=2550s) What leads to power seeking

results of this has been that um when your goals in the scenario when your goals are totally unrelated so like AI does not care one way or the other what the human wants human with the AL AI wants they're totally uncorrelated in that situation you tend to pretty consistently get the human and the AI um competing on Power and what I mean by competing is that like there's going to be a particular you know state in this maze that the human like likes a lot so the human sees there's a lot of power there whereas the AI dislikes a lot so it sees like very little power there and when I say like systematically competing what I mean is uh those like that the states where the human thinks are good tend to be consistently States the AI thinks are bad and vice versa so they actually actively like just having a neutral having your goals be neutral uh is enough to make you compete on like the instrumental sub goals so if your end goal is totally unrelated from my end goal that's enough to get us to compete on these Lo these sub goals so again for example suppose that like you and I have like random whatever random end goals well all else being equal if there's nothing else mitigating in this then we're gonna end up competing with each other to like make money or competing for some like pot of driving so like let's say like my end goal is like the thing I want out of life is to paint the White House blue and the thing you want out of life is to make the largest stack of paper pages that you possibly can um my first move and your first move like we might both want to get into engineering programs in college uh with the plan of like making a bunch of money that we can then use to finance our insane goals and so we'll compete over those slots in the top engineering colleges then we'll try to compete over making the money that we need to access to kind of make those things happen and so on and so even though those two goals are completely different and really like you ought to be able to say surely you can stay out of each other's way um depending on the way the world set up you will find competition so that last piece like depending on the way the world is set up seems to be really important here like how would we expect this to scale up into this complex world because like with the example I just cited right that seems like a counter argument to the idea that this should be a worry because if I want to paint the White House blue and you want to make a stack a heaping stack of papers like I don't think we would we in practice actually care about each other's goals like it seems like we can kind of ignore each other is that not the case yeah that's the intuition but uh whatever so the fact that like we end up competing that's true like in the average case and it's not it may or may not be a strong effect like it's still unclear like how strong that effect ends up being but it definitely seems to be an effect and it seems to be like a pretty consistent effect um so you know it might be that in that specific example you know you want to pay the White House blue I want to make like a giant stack of papers maybe in that particular example we don't end up competing but um over the set of all the possible goals that you might have combined with I might have question is like on average where do we end up with well on average we end up you know we end up in this competitive uh kind of position and so wow yeah and that's the case like again these are small scale things and they're intended they're built around a particular scenario where you have these humans and the AIS um only possible really to do the experiment under that scenario uh just because for technical reasons it turns out to be very hard to do like a general experiment and of course even beyond that uh it's there are there's a lot of distance right between these kinds of experiments in the real world this is just like well you know to the extent we're able to probe and investigate um we do see this actually very clear pattern of uh these two don't care of each other's goals well uh they're gonna end up competing at least to some extent uh on the average Case by default um and conversely if they have exactly the same goals as one another uh they will actually have identical powers in every state so the power that I have will be the same as the power that the AI has at this you know state in the Maze and so forth so they exactly agree on how good the states are on average so that's really interesting because that is a quite a powerful uh claim still you know to the extent that what you're saying here is like well look uh we don't know whether AIS will intrinsically be competitive with humans will intrinsically be trying to undermine manipulate or control human beings but um to the extent we've been able to probe this experimentally there is now experimental evidence in every like in every kind of statistical setup that we've tried the default outcome seems to be competition um and one of the things that means is as the as one of the bottom lines for

### [48:00](https://www.youtube.com/watch?v=dYSw-SV_fsI&t=2880s) Competition as a default outcome

this is that in order to get to neutral the point where you and I can Live and Let Live are not right competing with each other for money or college spots or whatever in order to get to that point requires a certain minimum degree of alignment between our goals we may be you don't have to be super aligned like crazy aligned but we at least have to be you know we have to agree on like 20 of things on average something I don't know something like that I really don't know what the actual number is I'm just throwing it out there but that's that seems to be one of the bottom lines here is like you and I need to have some kind of Baseline agreement about how we want the world to look if we want to get from default competing with each other to at least being able to stay out of each other's way and that's in the context of this human AI scenario so the idea here is like well you know if we want a world that at least like at a bare minimum you know this AI system is not you know doing things that we don't like we have to do more than zero effort to get a non-negative outcome for ourselves so essentially like it if you have a um you know an AI system that's super intelligent of the sort that as we've seen on the podcast like there are a lot of people including the very people who built the most impressive AI systems of our era who think that these sorts of systems could be developed in the next like I don't know next decades certainly but some of them have timelines that are quite a lot shorter than that if those systems exist like by default we should expect like to be about to be competed with to have these systems compete with us um uh that seems like a recipe for some pretty significant levels of risk um is that does that jive with your assessment yeah again there's a lot of distance between these results in the real world but to the extent we're able to do the experiment and observe the results that certainly seems to be a reasonable conclusion um the the one of the questions is um one of the interesting questions I think that is down the road for uh this experimental setup is uh how does the strength of this interaction change as you scale the system so as you make the world bigger as you add physical interactions between agents um which is one of the last experiments I did actually um as you start to add complexity and do things in your world uh how do those interactions what's the trend there does it do things seem to get better or worse as things scale so that's a really interesting question and the answer is slightly complicated and it has to do with the time Horizons of the different agents um maybe we can get into it uh later but it's like a sort of an extra little um subtlety uh but the I my intuition is that as these systems scale and are able to optimize more and have a longer time Horizon I suspect that you start to see um the threshold the necessary threshold of alignment increase so I suspect that it becomes more and more necessary to be you know more and more aligned with your system if you want the effect of that system on average on you to be neutral so like if you want at least this AI not to bug you or bother you uh you have to agree with it more and more on more and more stuff I'm not 100 sure that but that's my intuition based on what like weekly week intuition based on where I'm seeing the trends it makes sense and I guess the um well the interesting thing about this line of research I think is that it has and sort of see this in the kind of AI safety world that has um shifted the burden of proof now I think it's fair to say on to people who claim that oh no ai'll be fine like we can have super intelligent systems that are not going to pose a risk to humans just by virtue of being created by default it now see like every piece of evidence that we have like every experiment that we've done every like every attempt to actually like create simulations of what this might look like um seem to suggest that at the very least there's reason to be worried about this um and so by default I mean like to me that's a complete shift from where this conversation was at you know like five years ago you had people throwing around hypotheticals at one end hypothetical to the other and nobody really knew it kind of seems like the early evidence at least Is nudging Us in the direction of maybe a little bit more concerned which is

### [52:45](https://www.youtube.com/watch?v=dYSw-SV_fsI&t=3165s) General concern

bit a little bit more concerned which is unfortunate um I mean do you have any thoughts about you know people who might be listening to this and think hey you know I'd love to maybe contribute to this line of research too because it does seem very important what would you recommend for them yeah um so first off I would agree with you as far as I know this is the first direct experimental evidence for the instrumental convergence thesis and that's like I think that's a that's fairly significant uh and but in terms of contributions uh so actually one thing I should mention and thanks for reminding me is I'm actually open sourcing the whole code base that I use to do these experiments because my view now is that okay this is now in a position where at the very least you can see kind of minimal results coming out of this uh because of the way this is put together there is actually an enormous amount of um space to explore right in terms of um well you know what if we uh build a system that looked like this would that cause like this instrumental convergence thing right I've only explored a very tiny fraction of all the different possible configurations for this stuff the real um interesting thing is like well this fundamental human Aid AI scenario which we can play around and toy around with and the code base itself that implements this uh and the documentation around it that enables a lot of experimentation to be done and I am at the point where uh you know I I'm not going to be able to do all these experiments myself there's just so many different possibilities and it just makes more sense to enable people um to take it upon themselves and like take you know a weekend or a couple of weeks just play around and see if they have any ideas or thoughts about how to like test something out because the other thing here is that uh there's also the potential for something like this to serve as a generator for intuitions around solutions to this problem because if we set up a scenario where it's like oh all of a sudden you know maybe this effect disappears or it's less strong or we notice like oh this is no longer happening as strongly as we thought hey you know maybe that's a clue and it doesn't mean that necessarily this thing is going to work it could just be a fluke it could be anything but at the very least like it starts is the gears ticking around well you know what kinds of structures or setups could we have that maybe are more robust or less amenable to this kind of competition by default and so I think there's a rich space of uh areas to explore here uh that there's absolutely no way that I'm gonna explore myself and so that's what open source is for and ultimately we want this problem to be solved as fast as possible so this seems like the best way to do that yeah and it actually you know for people listening or people who maybe see me Alex Turner um podcast episode you know like this line of research is getting a lot of attention um like Alex Turner's work was featured in Europe's which obviously is like the number one AI conference in the world um so like this is at least when you talk to people in the AI safety Community like this kind of work is very interesting potentially like at the Nexus of a really big set of discoveries around AI safety or around long-term safety of AI systems and so if you're looking for an area where you can make a really big dent the sort of like low-hanging fruit thing to work on um I don't know I've like I've checked out a lot of this stuff and found it fascinating found it to be one of very few uh paths that seem quite promising I mean I'm biased obviously yeah and like we're brothers and you know I've heard you talk about this a very bit but like but I've also you know we've spent the last two years on the podcast talking to folks from deepmind open AI you know Google AI and so on and this really does seem like one of those very few areas where there's real promise for Progress so I just want to pitch it out there if people are you listening to this and you're like hey you know I'd love to do a little see if I can contribute to like technical AI safety um maybe download the like Fork the GitHub repo start working on it yourself and see what you can do uh because you know really great way to maybe make some pretty powerful and impactful uh research products um yeah uh and again just like with any research like they're you know there there's obviously uh a lot of assumptions that underpin it and a lot of uncertainty around it um so you know it's uh it's just like any other research in this field which is like it's pretty tentative it's a it's an attempt at getting to and understanding uh but yeah I mean I think that um with the repo with the code and the shape that it's in um at the very least if I've done a good job of documenting it uh it should be relatively low hanging fruit even to play around with so you can probably just like if you're

### [57:30](https://www.youtube.com/watch?v=dYSw-SV_fsI&t=3450s) Wrap-up

moderately good at python get something interesting going in a week or so that's at least the hope that's how I've tried to put it together so um yeah fingers crossed on that awesome well I think thanks so much I mean I think it's so great that you did that it's so great you you're kind of open sourcing it so anybody can play around with it so and thank you for sharing your uh your thoughts again on the podcast a lot of progress since uh last time you were on so it's great to hear and um anyway that's uh maybe gonna wrap it up for this episode if you want to follow Ed uh you are on Twitter on social media at Neutron xerons did I get that right Edward Harris oh I'm sorry score Edward Harris maybe we'll just add a link in the yeah we'll show notes great all right well thanks Ed thanks everyone uh that's a wrap thanks for having me on

---
*Источник: https://ekstraktznaniy.ru/video/45962*