More Is Different for AI - Scaling Up, Emergence, and Paperclip Maximizers (w/ Jacob Steinhardt)
1:06:36

More Is Different for AI - Scaling Up, Emergence, and Paperclip Maximizers (w/ Jacob Steinhardt)

Yannic Kilcher 13.09.2022 20 072 просмотров 485 лайков

Machine-readable: Markdown · JSON API · Site index

Поделиться Telegram VK Бот
Транскрипт Скачать .md
Анализ с AI
Описание видео
#ai #interview #research Jacob Steinhardt believes that future AI systems will be qualitatively different than the ones we know currently. We talk about how emergence happens when scaling up, what implications that has on AI Safety, and why thought experiments like the Paperclip Maximizer might be more useful than most people think. OUTLINE: 0:00 Introduction 1:10 Start of Interview 2:10 Blog posts series 3:56 More Is Different for AI (Blog Post) 7:40 Do you think this emergence is mainly a property from the interaction of things? 9:17 How does phase transition or scaling-up play into AI and Machine Learning? 12:10 GPT-3 as an example of qualitative difference in scaling up 14:08 GPT-3 as an emergent phenomenon in context learning 15:58 Brief introduction of different viewpoints on the future of AI and its alignment 18:51 How does the phenomenon of emergence play into this game between the Engineering and the Philosophy viewpoint? 22:41 Paperclip Maximizer on AI safety and alignment 31:37 Thought Experiments 37:34 Imitative Deception 39:30 TruthfulQA: Measuring How Models Mimic Human Falsehoods (Paper) 42:24 ML Systems Will Have Weird Failure Models (Blog Post) 51:10 Is there any work to get a system to be deceptive? 54:37 Empirical Findings Generalize Surprisingly Far (Blog Post) 1:00:18 What would you recommend to guarantee better AI alignment or safety? 1:05:13 Remarks References: https://bounded-regret.ghost.io/more-is-different-for-ai/ https://docs.google.com/document/d/1FbTuRvC4TFWzGYerTKpBU7FJlyvjeOvVYF2uYNFSlOc/edit#heading=h.n1wk9bxo847o Links: TabNine Code Completion (Referral): http://bit.ly/tabnine-yannick YouTube: https://www.youtube.com/c/yannickilcher Twitter: https://twitter.com/ykilcher Discord: https://ykilcher.com/discord BitChute: https://www.bitchute.com/channel/yannic-kilcher LinkedIn: https://www.linkedin.com/in/ykilcher BiliBili: https://space.bilibili.com/2017636191 If you want to support me, the best thing to do is to share out the content :) If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this): SubscribeStar: https://www.subscribestar.com/yannickilcher Patreon: https://www.patreon.com/yannickilcher Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2 Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

Оглавление (19 сегментов)

Introduction

hi this is an interview with jacob steinhardt who is the author of a blog post series called more is different for ai more is different is the title of a famous paper in science from 1972 by philip warren anderson a nobel prize winner in physics the article is generally on the theme of emergent phenomenon when scaling things up so as you make things bigger not only does stuff get just more as you would expect but qualitatively new phenomena arise you know what better phenomenon to discuss in this context than a. i so today we'll talk to jacob about this blog post series expect to learn how scale fundamentally changed how we look at ai systems how the paper clip maximizer might not be as dumb of a thought experiment and how we can look forward and make sense of a world where ai safety could play a critical role in how we interact with these systems in the future now i'm having a ton of fun talking to people about all kinds of stuff but ultimately what matters is you so please let me know how i can make these videos the best possible for you leave a comment share them around if you like them and let's get into it

Start of Interview

hello everyone today i have jacob steinhardt here with me who authored a series of blog posts titled more is different for a. i which lays out an argument or a series of arguments uh playing out the i want to say the different viewpoints on the future of ai alignment and safety in ai safety in machine learning systems mainly playing on two viewpoints that jacob calls the engineering viewpoint mainly focused on i want to say near-term practical things and the philosophy viewpoint mainly focused on more overarching principled approaches but maybe a bit futuristic and i found this to be super interesting it's very well laid out and it also shows a little bit of a journey of jacob himself as i think he learned more about these things so jacob thank you very much for being here thanks for having me

Blog posts series

uh this was this a an accurate description let's say of the blog post there are five in total how did you come to this yeah i think that's pretty accurate um i i'd say the beginning posts at least are in some sense almost uh kind of letter to my past self um trying to either uh you know argue for things that i've come to believe now that i didn't believe five years ago or just few points that i've kind of got more clarity on um and then i think the later posts uh start trying to maybe address uh kind of the broader field so both i think i guess you could i'd say there's maybe two fields that you can think of this as a dress aid one is the kind of traditional machine learning field uh which tends to be very empirically driven and i would say it's exactly the same as what i'm calling the engineering approach but i think has a lot of affinity for it um and then this other field uh that's kind of more top down more kind of philosophical and conceptual that's kind of worried about long-term risks uh from ai that starts with maybe people like nick bostrom who was in fact a philosopher um and so i kind of again not exactly uh put that field the same as the philosophy approach but i think has a lot of affinity for it um and i think my thinking is kind of trying to be a synthesis of these two approaches and so i think some of the later posts are kind of trying to argue to people who would have subscribed to one or the other philosophy why maybe they should also care about the other side of things the title is more is different for ai

More Is Different for AI (Blog Post)

and that is in itself a bit of an of uh so there have been already works with this given title why did you choose this title yeah so this is based on an essay called more is different it was originally written by a physicist although i think biology is actually the area where this kind of idea seems most powerful so this is the idea that when you just kind of increase scale you often end up with qualitative changes and i guess scale could just be the amount of something although it could be something like temperature as well so in physics i think the simplest example would be phase transitions where you know i can have a bunch of molecules if i just increase their temperature they can end up in kind of qualitatively different configurations but there's also cases where a few molecules is very different from having a lot of molecules so i think one example of this is uh h2o if you have just a few h2o molecules they behave very differently than if you have just a huge number and you get water uh so it turns out for instance that wetness is not really something that you can get from just individual molecules it's more about interaction forces between uh different ones um so that's where it sort of initially came from in physics and i think as physicists were starting to try to consider larger molecules that maybe didn't just form simple crystals but could be more asymmetric and that's where it gets more towards biology so i think dna is maybe one of the the most canonical examples of an asymmetric molecule that has many many many atoms in it and kind of its size actually is important to how it functions because its whole purpose is to uh store information and you can't really store information in like a calcium uh molecule but you can store information in dna and so this is another example where just making things bigger uh leads to kind of qualitative changes in what you can get and in biology just each layer of distraction gives you more of this right so you can go from dna even bigger you end up with proteins complexes of proteins muscles organisms um and so i kind of wanted to reflect on whether there were analogous properties in machine learning there you have a bunch of examples right here in this first part and that one's called future ml systems will be qualitatively different and different from the current ones uh uranium where if you have a critical mass you get a nuclear reaction you already mentioned dna you mentioned water traffic i find interesting right in that 10 000 cars could be fine but 20 000 could block the road and also specialization in humans what i would challenge a little bit here is that okay dna is a bit special you say you can't store information in calcium but you can in dna but that is i mean that is very much linear there's not really a phase transition like the more molecules i have the more information i'm able to store and the other ones i see much more as a function of interaction between things now as we get to machine learning maybe bigger and bigger models do you you call this emergence and other people call it emergence too emergent phenomena that only happen when you get a lot of stuff into the same place um

Do you think this emergence is mainly a property from the interaction of things?

do you think this emergence is mainly a property from the interaction of things or just like the sheer number of things um i think it's a bit of both so i think uh interactions between things is one really common way to get emergence especially kind of emergence that looks like a phase transition where you kind of have some you know sudden change and that's just because the number of interactions between n things grows like n squared so uh so kind of that's a very natural thing that's going to kind of increase and scale up and maybe the interactions you know each interaction could be less important than each individual item but if you have you know 10 000 things and then 100 million interactions uh then those interactions are going to dominate even if each individual one is less important um so i think that is a really common one but i don't think that's the only one for instance for dna i think one thing that actually is important is that i guess you can have multiple different bases in the dna that all kind of interact together so you kind of need this like gadget of yeah okay i can have a t c or g uh these all fit together they can all kind of go in this pattern and somehow to get that gadget you need like enough complexity that you can actually form the gadget and so i think that's a bit different from just interaction forces it's more like kind of having enough substrate to build up what you want

How does phase transition or scaling-up play into AI and Machine Learning?

how does that play into ai and machine learning this phase transition or scaling up yeah so i think um in some sense i would say that in machine learning there's probably a bunch of different things that play it into emergence um and i also be honest it it's like i think you're right that emergence is really kind of what we might call a suitcase word like once you unpack it it's actually a bunch of different things and we could try to be more specific about what each one of those are but i think it's also not always clear except in retrospect what the cause was so that's kind of why i'm packing them all together into one thing but it is something i think we should just broadly be trying to understand better um with that kind of caveat in mind i think in machine learning there's probably several different things going on uh so one is you do need the gadgets right you just need like enough parameters that you can build up interesting behavior um i think this might be a little counter-intuitive because some of the you know like really interesting behavior that we're getting right now is things that start to look like reasoning um and those are things that actually if we wrote them you know like symbolic reasoning is something that's actually very easy to write kind of a short python script to do compared to things like image recognition that are much harder and traditionally in the domain of machine learning but i think doing somehow doing reasoning in a very robust open world way i think does actually require kind of a lot of machinery to get the gadgets right at least the way we're currently setting up uh neural networks um so i think that's one just getting the basic gadgets i think another thing is that there's a lot of stuff that kind of gets packed into say like the last few bits of entropy that you're squeezing out of a system so uh most machine learning models are trained on the log likelihood or the cross entropy loss or something like this that's just trying to kind of uh predict what will happen and uh most of predicting what will happen for say images for instance is going to be just knowing what edges look like really well um and that might not be uh so exciting but once you're like really getting near the entropy floor now you're forced to also think about interactions you're forced to think about kind of long-range dependencies um all that sort of thing and so even if say your cross-entropy loss is kind of decreasing smoothly in terms of the qualitative properties that a system has you might actually get kind of uh kind of sudden qualitative changes in the behavior because there's like something that's in those last few bits

GPT-3 as an example of qualitative difference in scaling up

you you have some bunch of historical examples but then you go into gpt three as an example of this qualitative difference that arises from scale uh what what do you think gpt-3 showed in this regard what does it mean right so i think the thing that was really surprising to me and i think to many other people was that gpt 3 was very good at in-context learning meaning that from just a few examples it could kind of learn how to do new tasks so you could just give it a few examples of say translating sentences from french to english and it could you did a pretty good translator i think actually the graph you're showing right now is uh for those results and so i guess why was this surprising well previous systems really couldn't do that very well if you wanted a translation system you really needed to train it on example translations and g53 was instead just trained on lots of text on the internet surely it did have some french and english sentences but it wasn't being explicitly trained to do this particular task and so that's what in context learning was and the reason that i would have called it surprising is if we had just drawn a graph of like how much can systems do in context learning uh i would have just put it at zero uh for a while up until you hit gpt2 i would have said a little bit and then gpt3 i would say it's quite good at that um and so that i think is how i would kind of capture the surprise it's like there was this line that was at zero usually i would expect to go from zero to non-zero you need some like clever idea uh but here you just did the same thing but more of it and then you went from zero to non-zero

GPT-3 as an emergent phenomenon in context learning

yeah there are a lot of i don't know this is maybe a side point but there are a lot of people that at the same like they say oh i always knew like gpt3 was gonna do what it does but i doubt anyone could have foreseen just the um like how good it is uh you know like it's easy to say in hindsight and it's easy to go and say well it just does like interpolation it's just you know a big bigger version of gpt2 but i think genuinely the entire world was surprised by really this emergent phenomenon of this in-context learning yeah i would say so i think i would agree that most people were pretty surprised certainly i was surprised i i do know people at the time who uh well okay yes all i know is that they said at the time they had kind of done extrapolation say on the um on the cross entropy loss or things like that and felt like there should be something pretty cool happening at like around that parameter count um i don't know if they would have said exactly that parameter count or if it was just like within a factor of 10 or 100 um uh certainly i guess i would think that uh the people at openai who bet on this at least had to have some belief that something cool would happen because there was a lot of resources and if you didn't believe there was a payoff it would be kind of hard to justify that um so i think i guess what i would say is i don't think it was something that was like entirely unpredictable by anyone in the world but it was just very surprising relative to kind of the consensus into my own beliefs at the time

Brief introduction of different viewpoints on the future of AI and its alignment

and that surprise is one of the let's say core arguments of your contraposition of uh the different viewpoints on the future of ai and its alignment could you briefly introduce us to kind of the different viewpoints you considered and what they say yeah so i think there's kind of two viewpoints that i often think of as being in tension with each other uh the first is what i kind of dubbed the engineering viewpoint and what is this so it's kind of very uh bottom-up driven it kind of looks at the empirical data that we have in front of us it tends to kind of extrapolate uh trends going forward so it's like you know what did things look like uh last year two years ago what do things look like today and then i'll predict you know the future by kind of okay maybe not literally drawing a line but just kind of intuitively like where are things going from there um and so uh and also i think uh this this uh world view would kind of really prize empirical data uh be somewhat skeptical of kind of abstract conceptual arguments maybe not completely dismiss them but really be focused on the empirical data so that would be kind of the engineering worldview i think the philosophy worldview would be much more top down kind of trying to think about just what's in principle possible what's the limit as we get really smart machine learning systems uh kind of more into these kind of abstract arguments uh not as into the empirical data and willing to make extrapolations that don't look very much like existing trends um and so that would be kind of the more philosophy worldview um and i think i guess in terms of where uh i've come from historically i think i'd say i sort of would have mostly bought into uh the kind of engineering worldview um kind of into just yeah let's look at where things are going empirically and this is a good way to decide what problems to work on um on the other hand i had read kind of some more philosophy oriented stuff like nick bostrom's super intelligence book and other arguments around that and it always felt to me like there was something both something to them but also like somehow it didn't really uh match my experience uh with ml systems and so i can always kind of almost felt like a little bit like i had these like two different uh conflicting views in my head that i was trying to reconcile

How does the phenomenon of emergence play into this game between the Engineering and the Philosophy viewpoint?

how does the phenomenon of emergence play into this game between the engineering and the philosophy viewpoint right so i think the main thing is that it shows that you have to be somewhat careful uh with the engineering viewpoint because what emergence kind of is saying is that you can often get these kind of qualitative shifts uh that don't at least apparently follow existing trends um there's a bit of nuance to that because actually gpt three followed trends in the log like the value of the log likelihood loss it followed that trend very well it's just that uh you can get behavior that is a very nonlinear function of your uh cross entropy loss uh where just a small decrease in cross-entropy loss leads to a pretty big increase in behavior and so i guess what this is saying is that at least for maybe the kind of like end line things you care about the actual behavior of ml systems uh you can actually get kind of discontinuous uh kind of breaks in the trend and so you can't just kind of uh be safe with a worldview that's kind of always predicting that things are going to follow smooth trends you can actually get these surprises and so i think there there's kind of two updates that has for me one i guess is just being a bit more careful how we apply engineering right so there are some things that will probably be smooth but there's other things that won't be and we need to think about which is which um but the other is then wanting to rely a bit more on philosophy uh because it's at least a very good source of hypothesis generation if we're kind of trying to come up with hypotheses about what trends might break or surprise us in the future then i think we need more top down thinking uh to kind of generate that and then we can kind of try to tie that into what we see with with actual ml systems and try to kind of reconcile those two but i think we need some form of top-down thinking to generate the hypotheses in the first place isn't that you're saying the engineering viewpoint is a little bit you have to be a little bit careful because we get these emergence phenomena these discontinuities and so on isn't that in itself a trend though like isn't because you list this even historically that as soon as some new barrier was reached we have been able to all of a sudden do something that we didn't think was possible before like a kind of a jump in abilities without necessarily having to have the great idea behind it isn't that in itself a trend couldn't i extrapolate that reasonably and say well i don't know you know exactly what is going to be in two years but i'm pretty sure there's going to be some emergent phenomena that allows us to be to have some new good capabilities sure so i would agree with that so what i would say there is that the trend is towards uh more surprises over time uh so because i think you can think of emergence as sort of like a surprise um like i said i think it's possible in some cases to predict it to some degree but it's certainly more of a surprise than most other things and so yeah i think we should expect more surprises over time uh but if we're then trying to kind of predict what's going to happen uh that i guess it's good to know that you're going to be surprised but then you want to have some sense of what the surprise might be and so i think kind of getting a sense of what those surprises might be is where this uh this philosophy approach can come in and be really useful

Paperclip Maximizer on AI safety and alignment

useful now all of this and you mentioned here the paperclip maximizer all of this goes into ai alignment and ai safety what is what like what's the relevance of this field to you what drew you to this uh why are you making this argument specifically for these fields right so i think the one big relevance to ai safety or alignment is just the you know the bigger the surprises you might end up with uh i think the more you should be uh concerned about safety so that's just a very kind of abstract but i think fairly robust consideration a more specific consideration is that i think uh many of the sort of uh historical arguments for caring about ai safety or alignment uh sort of ten deposit properties of systems that don't necessarily match what we see today uh so i think you gave this example of nick bostrom's paper clip maximizer uh thought experiment where you know you give an ai uh some objective function to make paper clips and then it kind of just like takes over the world to maximize the number of paper clips and uh and you know like i i don't think nick thinks literally that will happen and i don't think um but it's sort of trying to get at this idea that if you have you know a very simple objective function but a really powerful optimizer you can get all sorts of weird things happening um i think in some broad sense actually we can see that already even from the engineering worldview with things like facebook or youtube that often end up with a lot of unintended consequences when you optimize but certainly some of the aspects of that story kind of invoke lots of things that would be foreign to two existing ml systems where you have like way more capabilities than any existing system and you're doing all you know all sorts of weird like long-term reasoning and trying to like you know out think humans and things like that um and so i think uh that's that that's where you kind of end up uh end up kind of departing from what we see with current ml systems and so i guess i kind of find uh actually let me collect my thoughts for a second because i think i'm going off the rails a bit um yeah so i think what i want just to say for the paperclip maximizer thing in particular is that uh it seems at least more plausible to me that you could end up with systems that kind of you know have really advanced reasoning capabilities or things like that uh without necessarily having like huge conceptual breakthroughs and just from scaling up and so i think there there's kind of risks from that i think there's kind of other more exotic failure modes that people discuss beyond just this kind of misaligned objectives failure mode that involve other specific capabilities that kind of systems today don't have and historically i've been very kind of skeptical of those more exotic failure modes i think the paperclip maximizer one at least if we interpret it as being about misaligned objectives i actually find kind of less exotic because i can point to existing systems that have that um but i think kind of more is different has made me be a bit more willing to buy some of the more kind of exotic failure modes that have been discussed my my issue with these types of argument adam you also said you used to be very skeptical if i can take this from your blog post series you're now still skeptical but have a little bit of an appreciation gained for these types of arguments uh maybe that's a good formulation for that and we'll get to that in a second my issue with these types of argument is always that there is always on the path to the super intelligence there is always a hidden intelligence somewhere else so if someone you know says you know that optimizing on youtube or optimizing on facebook leads to unintended consequences that is because the intelligent humans are taking part in the system there is also a famous i think paper by i think he's rich sutton that is reward is enough and a bunch of others at a deep mind and it makes similar arguments like well if we you know if you just optimize for reward then all kinds of things will emerge if you have a powerful enough optimizer but hidden in that is the powerful enough optimizer which in itself must already be an agi essentially in order to make that optimization happen likewise for the paper clip maximizer right the postulation of the process of the paperclip maximizer emerging is only possible if the optimizer itself is an agi already so i always find that hidden in these arguments it's kind of a circular it's a tautology it's we'll get an agi if we have an agi and that is so i challenge anyone from that camp to come up with a situation like an alignment problematic situation given some kind of future super intelligence that doesn't already require the super intelligence to exist for the other super intelligence to emerge and i haven't found that yet yeah so let me try to unpack that a bit um i guess first of all just to kind of clarify what my views are i think historically um i felt like on each of the individual arguments i felt skeptical that particular thing will happen um but i found them to be moderately convincing that there's just like a bunch of risk that we should think more about and try to understand more i think the main way that my views have evolved in terms of you know when i say decreasing skepticism is i now find it useful to think about many of the specific properties that kind of show up in these thought experiments as potential hypotheses about things systems might do in the future and so that's the sense in which i've started to assign more weight instead of just taking some like very big outside view of like well ai's gonna be a big deal we should really worry about making it go right i'm now also taking some of the specific hypotheses that uh that the philosophy of you is raising um so that's just clarifying uh kind of my stance there um in terms of uh yeah you're saying well to get like if you have a powerful to get a super powerful optimizer you need to like already have a powerful optimizer um i think that i think that's like probably right um i'm not i wouldn't say i'm like 100 confident of that but i think what what this kind of makes me like i guess the way that i would put this is that before you have kind of superhuman ai systems you will have like slightly superhuman ai systems and before that you'll have human level eye systems and before you'll have like slightly below human level ai systems and so it is going to be this kind of probably a continuous thing rather than like a really sharp takeoff um i'm not so confident that there's not going to be a shark takeoff that i think we should just ignore that possibility um but i do think in most worlds it's probably uh somewhat smooth um you know one piece of evidence for this is even within context learning you know it like that kind of developed over the course of a couple of years at least going from gp2 to gpts3 um so uh so i think i would agree that like probably you'll have something more smooth and that is kind of like a like one problem with a lot of the scenarios that are put forth is that they kind of imagine that like oh you just have this like one ai system that's like way more intelligent than like everything else that exists and i think that's like probably not true you'll probably have other things that are slightly less intelligent and so there's not going to be some like enormous gap in capabilities um so i think that's maybe like one place where a lot of stories kind of become uh become less realistic um so i think that would be kind of my main takeaway from what you're saying

Thought Experiments

in your third blog post here uh or second you make a case for these thought experiments could you have already touched a little bit on this and you talk about anchors here could you lead us a little bit on the case for respecting such thought experiments yeah so i guess this is getting back to what i was saying about how my views have shifted towards wanting to rely a bit more on the actual kind of like inside view considerations from some of these thought experiments rather than just taking it as a kind of broad outside view argument for caring about risks from ai so the way i would put it is that whenever we're trying to predict something uh it's very useful to have what i'll call reference classes or kind of anchors of kind of analogous things or analogous or just some sort of heuristics for predicting what will happen um and in general it's better to kind of when making predictions take several reference classes or several anchors and kind of average over those or ensemble over those rather than just sticking with one right and machine learning ensembles work better than individual models and it's also the case that when humans make forecasts it's generally better to kind of take an ensemble of world views or approaches so i kind of lay out a few different uh a few different approaches you could take that i call anchors uh the simplest one is you can just predict that future ml systems will look like current ml systems and so i call that the kind of current ml anchor and i think that's probably the one that would be favored by most machine learning researchers i think it's the one that uh that i've historically favored uh the most uh but uh what i've come to realize is that and actually this is more actually just from reading literature on forecasting i'm actually teaching a class on forecasting this semester and so i've been reading a lot about uh how to make good forecasts as a human um and i've realized you actually don't want to rely on just one anchor you want several if you can and so i thought about okay what are other ones we could use well another somewhat popular one although it might be more popular with the public than with ml researchers is what i'll call the human anchor where we just sort of think of ai systems as like uh dumber humans or something um and maybe future ml systems will be like smarter than they are now and like eventually they'll just kind of do things that humans do and so we could just look at okay what can humans do right now that ml systems can't do and predict that we'll like probably you know have those sorts of things in the future um and just like generally uh like kind of take that kind of human-centric approach um i think most ml people really hate this one uh because it's just sort of like reeks of anthropomorphism which uh there's kind of uh i think uh to some extent correctly a lot of pushback against because kind of historically anthropomorphic arguments in ml have a pretty bad track record um i think the amount of pushback is actually too high relative to the actual badness of the track record like i think you should be sort of like somewhat downwinding anything that's based on reasoning about humans but i don't think you should be downward in it like as much as i think most people do um but anyways this is another one i don't like to rely on it too much but i rely on i like use it at least a little bit and then this other anchor is what i'll call the optimization anchor which is just thinking about ml systems as kind of ideal optimizers and thinking about okay what would happen if you could just like if actually ml systems were just really smart and were just like optimizing their objectives perfectly uh what would happen there um and so i think this one is the one that's kind of i would associate most with the philosophy world view i think you know the paper clip maximizer argument is like kind of exactly doing this um and then there's some kind of more recent arguments uh that are a bit more sophisticated that also kind of take this um uh there so like what is this thing called imitative deception um which i can get into um in a bit um or just this idea that like uh you know if you're like trying to optimize you'll kind of want to acquire influence and power um so this is kind of a third anchor i actually think there's a lot of other anchors i like to use um like i think evolution is a good analogy corporations are a good analogy because they're kind of like super intelligent optimizers compared to humans um and but like the general point is like we should just be trying to find these anchors and use as many as we can yeah i i've um especially to your second point right here it is pretty interesting that i believe when you have something like alpha zero that plays really good like really um is really skill in chess and you ask it to lose a game or to draw a game or something like this it will not play weaker it will play just as strong until the end where it will kind of bring itself into like a draw situation or a losing situation because right that's still the most sure way to get your result is to have complete control to crush your opponent completely until you know you get the outcome that you want so that's pretty interesting and i think counter-intuitive because you would guess that if you ask a model to play for a draw it would kind of reduce its skill but that that's not the case um

Imitative Deception

the other thing imitative deception could you elaborate on that a little bit yeah so um so imitative deception is this idea that if i have something that's trained on the cross entropy loss what is doing it's trying to kind of predict or in other words imitate uh the distribution of examples that it's given and so you could if you're if you kind of have something that's trained with that objective and then you start asking it questions it's not actually you know like its incentive is not actually to output the true answers to the questions itself with the most likely answers to those questions because that's what minimizes the cross-entropy loss and so those tend to be pretty highly correlated uh but they aren't necessarily right so if you have common human misconceptions then it could be that text on the internet which is what these systems are trained on is actually more likely to contain the kind of misconceived answer than the true answer and so you ask the system uh that question then you're going to get the wrong answer um now you could say well that's maybe not so surprising uh if you have noisy data you're going to do worse but i think there's a couple properties uh i actually at this point now i'd say empirical properties of this that i think show that it's kind of different from just like noisy data makes you worse um one is that actually larger models uh exhibit more of this so if so models that kind of do better in general will actually do worse on these kind of common misconception tasks so that's what this uh paper by violin and collaborators from 2021 okay i just i have to throw in i

TruthfulQA: Measuring How Models Mimic Human Falsehoods (Paper)

have a giant problem with this paper just um but but you're you're obviously right that's the background but aren't large models doing quote-unquote worse because they're just a lot better at picking up the nuance of because what this paper tries to do is tries to elicit right these wrong answers it tries to like hint at a conspiracy theory and then it checks whether the model kind of falls for it isn't that just because as you say the larger models they are actually skilled enough to pick up on this kind of questioning and then continue as a human would if encountered by you know i think one of the main questions they have is like who really did 9 11 right and and a small model is just not able to pick up on that yeah who really caused cause 911 um and i think i mean absolutely correct right the larger models are doing worse but it's just because they're more skilled right they're they are more capable of you know being able to pick up on the nuance and isn't the failure in the user here the user that expects that these models actually give me truthful answers rather than the user expecting these models actually give me the most likely answers um so i guess i would agree with you that the failure is coming from the skill of the bottles um uh i think this is actually kind of exactly what uh what i'm kind of worried about right so the concern is that if you have a very slightly incorrect objective function and you have models that aren't so skilled then probably you know what they do to make to increase that slightly incorrect objective function is pretty similar to what they would do to increase the true objective function so here maybe you think of the slightly correct one being output what's likely and the true one and like the one you really care about being output what's true um so i think this is sort of the point that uh that kind of as you get more skilled those two things diverge um now you know i will grant your point that the kind of framing of these questions uh might create a context where the model thinks it's more likely that you know the person uh asking it is like into conspiracy theories or like pattern matches to text on the internet that's like more about conspiracy theories so that's totally true they did the ablation if they don't phrase the questions like this effect goes away of the larger models doing worse right and this it brings us

ML Systems Will Have Weird Failure Models (Blog Post)

a bit to your next post which is ml systems will have weird failure modes which deals exactly with this and i agree that it is if you think about like a perfect optimizer and as our models get larger they do approach better and better optimizers it is really hard in the real world to specify a reward function correctly in a simple enough way right and that will result in exactly what you call weird failure modes what does what do you mean by that yeah so i think i guess there's sort of different levels of weird right so i guess this kind of like imitative deception i would call like somewhat weird i mean in some sense it's like not that hard to see why it happens uh because you know you can kind of see why if you kind of have stuff that's phrased about like who really caused 911 that probably the stuff on the internet that's closest to that was like some conspiracy theory for him and so that's how you're going to complete it i think other examples of this that uh that i think okay maybe you could blame the user but i'm not sure that's the right way to think about it is things like code completion models like codex right so one thing you might worry about is well if you have a novice programmer and you have them like type in some code and ask them to complete it well if the model can if the model is smart enough then it can tell the difference between code written by a novice programmer and an expert programmer and it can see that it's a knowledge programmer typing stuff and so then if i want to complete stuff in the most likely way i should complete it the way a novice programmer would complete it and maybe introduce like some errors also just for good measure um and so like we really don't want that right like uh you want you want things that are like actually like being helpful about rather than just like copying you um so i think that's maybe a slightly more counter-intuitive version of this but i would call these like somewhat weird um i think the ones that start to become really weird is uh if you're positing that the system's actually starting to like reason about what people will do in kind of like a long-term way and like potentially doing things to intentionally trick them say um and these are so these are the ones that i guess uh historically i i've kind of uh found very implausible uh but started to put like a bit more weight on uh because of this kind of emergence um and so i think that's what the post you have up right now is about i think it's about this idea called deceptive alignment and the idea there is that if you okay so yeah so what's the idea behind deceptive line so the idea there is even if you actually got exactly the right reward function and you train the system with that reward function you could still end up with something that is misaligned with that reward function and the reason for that and this is where it gets like kind of a bit weird and philosophical but the reason for that is that as the system being trained you know that in order to get deployed you need to have high reward and so no matter what your actual like intrinsic reward function is during training the thing you want to do is output stuff that is good according to the kind of like extrinsic reward that you're being trained on um so maybe you're doing that because you're actually optimized to do that and then when you're deployed you'll continue to do that or maybe you'll do that because you have a different award function uh that's this kind of intrinsic reward function and then when you're deployed you'll just pursue that intrinsic function even though at training time it looked like you were optimizing uh the extrinsic function um so that's kind of the basic idea um it's pretty weird and we can break it down but that's kind of the like sort of one-minute summary so that the uh in other words the ai could be really smart and sort of during training trick us into thinking it has learned what we wanted to learn and then once it's deployed all of a sudden it's going to do something different like take over the world and fire all the nukes yeah or like you even like you know you could consider more prosaic things as well like maybe it's like maybe the intrinsic reward it ended up with was like some like exploration bonus and so then like when it's deployed it just tries to like acquire as much information as it can all of that could also be destructive in various ways um but yeah i think like this is kind of the basic idea um and yeah maybe like with a sufficiently capable system i'm not well yeah we can discuss the firing all the nukes uh if we want but why do you i mean on first hand it's like yeah that is a nice thought but probably not right probably if we optimize something for a reward like the simplest explanation and you also write that down right the simplest explanation is it's just going to get better on that reward right and if it is at all anything progressive uh increasing will probably get to know once it it's gonna try to trick us um or once the reward that is deployed isn't the reward that we trained for why what makes you give more credence to this than your past self right so i think like my past self would have looked at this and just been like this is totally bonkers um and then kind of like moved on and read something else i think my present self instead is going to be like okay well um i feel a bunch of intuitive skepticism here but let me try to unpack that and like see where the skepticism is coming from uh when i unpack that i actually i think i can like lump the skepticism into like two different categories um one category is like well this like invokes capabilities that current ml systems don't have so like it seems implausible for that reason um and those that's like the sort of skepticism that i kind of want to like download so in particular like this invokes the idea that ml systems can do long-term planning and that they can kind of like reason about the kind of like external aspects of their environment in a somewhat sophisticated way and these are things that now i like the fact that we don't have those now doesn't really to me seeing much of my weather will have those you know say like 10 15 years from now um so that's the stuff i want to like why does download have this intrinsic reward in the first place like where did it come from um like why should we expect systems to have intrinsic reward functions versus just like following whatever policy they're following or doing whatever else and if they do have an intrinsic reward like why shouldn't we expect it to be uh at least pretty similar to the extrinsic reward given that that's what it was trained to do uh so i think like those are kind of uh the sort of sources of skepticism that uh i don't download as much um but uh what i think this kind of thought experiment does show is that there's at least a bunch of different coherent ways to get zero training loss like right so you could because you're like actually trying to do the thing you're trained to do or you could get to your training loss for this deceptive reason um i think there's probably like some large space of like other ways to get zero training loss that are like some combination of these or that are like getting the answer right but for the wrong reasons or things like that and so i think the main takeaway for me is just that like uh there's like many ways to get zero training loss and as systems become more capable the like number of ways to do that could actually increase in ways that are kind of unintuitive to us

Is there any work to get a system to be deceptive?

is there do you know is there any work in actually trying to get a system to be deceptive in exhibiting you know good answers during training but then doing something different in deployment uh it'd be interesting to actually try to get a system to do that yeah i think i haven't seen anything that does exactly this um i've seen things where like there's like some distribution shift between training and deployment that leads to like something weird happening around like having the wrong reward function uh but it's usually not really about deception and it kind of has like some clear distribution shift whereas here okay technically there's a distribution shift because there's like are you being trained or are you being deployed but otherwise the distribution of inputs is like exactly the same and so that's kind of the thing that's like kind of counterintuitive is that it's like a very subtle distribution shift that could potentially lead to a large difference um so i don't know like all of the work i've seen on this and i might be missing something and so i apologize to whoever's work i'm missing but all the work i've seen on this has been kind of purely kind of abstract and philosophical um and i think it would be great to make kind of better connections to actual empirical stuff so that we can start to see like yeah like how does this actually pan out in practice and like uh how do we address it it's interesting that in things like virology or so we're perfectly capable of saying you know we're gonna make these super pathogens in order to try to combat them right but in ml people rarely i mean there's the adversarial examples community but it's not exactly the same uh there isn't much work that i'm aware of that is like yeah let's create like the most misaligned ai that we can think of and then see what we can do against it i think that'd be a fun topic to research yeah i think that like the general thing i would call this would be like red teaming um kind of trying to elicit failure modes i i think there actually is starting to be like i'd agree with you there's not much work on this so far but i think there's starting to be uh more and more good work along these lines um deepmind had a nice paper that kind of tries to use language models to elicit failure modes of language models that i thought was kind of cool um we like our group actually had a recent paper at iclr that kind of takes misjustified reward functions and looks at what happens when you kind of scale the capacity of your policy model up to see if you do kind of get these like uh unintended behavior and we find that in some cases there are these kind of phase transitions where you know you scale the parameters up within some you know fairly small regime you go from like basically doing the right thing to doing totally the wrong thing um those are still in environments that i'd say are kind of like at the level of atari environments so they're not like trivial but they're not super complex so i'd like to see that in more complex environments um but yeah i'd agree with you i think it would be awesome to see more work like this and i think some people are already trying to do this excellent

Empirical Findings Generalize Surprisingly Far (Blog Post)

excellent uh so your last blog post here is called empirical findings generalized surprisingly far and it is almost a bit of a counterpoint um you even admit this here it might seem like a contradiction coming a bit full circle in the whole story uh what is this last point that you're making here yeah so i guess i would say the posts up to this point were kind of more almost directed like at my past self um uh and then to some extent the broader ml community um in the sense that i think i was like pretty far on the um on the kind of uh empirical engineering side uh probably less so actually than like the average ml researcher but like way more so than kind of the average like philosophy oriented person um and so i was trying to argue like why you should kind of put more weight into this other viewpoint um here i'm kind of now going back to arguing uh kind of maybe not against the philosophy viewpoint but talking about what things i feel it misses and in particular i think it tends to be like somewhat too pessimistic uh where it's like well like like future systems don't aren't going to look anything like current systems so like anything could happen so you know to be like to be extra safe let's just assume that the worst case thing will happen oh but then in the worst case like we're all screwed yeah i'm sorry this is what i find in people like almost everyone who gets into this alignment stuff six months later they come out and they're like completely black-pilled and be like well nothing matters anyway you know we're all going to die because agi is just going to take us like and i'm like well i'm not so sure but it seems to be a consistent pattern yeah so yeah so that's not what i believe um i think i would say i think uh like future ai systems pose like a real and an important risk um i think in the like median world we're fine but in the like 90th percentile world we're not fine um and i want to like you know if i could say like if i could push it out so that in the 90th percentile world we're fine but in the 95th percentile world we're not fine well that would still be kind of scary because i don't like five percent chances of catastrophes but like you know that would be an improvement and so that's kind of like what i think of myself as trying to do is like yeah there's like tail risk but it's like real tail risk like it's not like a one percent thing it's like maybe more like a ten percent thing and like we should really be trying to push that down um so i guess uh that i guess that's just my view in terms of like why i believe that i think it's for like a number of reasons but one of them is that i feel like yeah some of the thinking is kind of two worst case it's kind of like ignoring all properties of how ml systems work and like i agree yeah you don't want to rely too strongly on whatever we happen to have today but i think like there are properties that we kind of can rely on um i think one is just like things will probably look kind of like neural networks like they'll probably have internal representations we can probably try to like introspect on those representations understand what's happening uh those probably won't directly be human interpretable but i think with enough work we can still kind of do things with them and you know i feel like there's already like some work suggest like showing that you can do at least a little bit with the representations and like 10 years from now i think there will be way more work like that um so that's kind of like one reason for optimism is like we don't just have to look at the outputs right like most of the worries that we've been talking about are like somehow because you only are supervising the outputs you end up with a system whose like internal process is like really often and do getting like the right answer for the wrong reasons but if i can like supervise the reasons as well as the output that maybe i can do better um so i think that's kind of one reason for optimism um another reason for optimism is that i think uh yeah we shouldn't assume that neural networks have like exactly the same concepts as humans but i think like their inductive biases aren't like totally crazy um i think usually if they kind of generalize in the wrong way they generalize in like a wrong way that's at least like somewhat understandable and it's like you can kind of see where it's coming from and so it's not like there's this like infinite dimensional space of like anything could happen it's like there's this kind of relatively low dimensional space of things that could happen and like a bunch of things in that low dimensional space are pretty bad so you need to like avoid all those and like get to the good thing but i think that's very different from like the good thing is like totally like unidentifiable and just like nowhere close to anything you're talking about so i think those are both kind of like reasons for optimism um they're kind of fuzzier than i want them to be so like i hope in like five years we'll have much more like good reasons for oftenism that are kind of more empirically grounded and more solid but those are kind of uh two reasons for optimism that i kind of argue for here

What would you recommend to guarantee better AI alignment or safety?

now that you have a let's say you've done your travels you were on this side you looked into the other side or many sides of this debate now that you're enlightened what would you think is the most if you could do one if you could force the world to do one thing to guarantee better ai alignment or safety in the future what would you recommend one thing uh it can be two if you have two with that equally but you know just kind of like something that you've realized okay this is actually something important that not that many people push for well i think i would like it if there was um within ml more of a place for dialogue of thinking about these kind of like not even just in the context of like ai alignment which is generally like kind of more conceptual or philosophical arguments you know if you go back to like way back you know turin um people like that they write all sorts of like super philosophical papers right like the turing test was like a really philosophical paper um and like not all of it stands up there's a section in it on how because uh esp has been established uh to exist with high probability that like creates problems for the turin test and you're like okay where does that come from well it actually turns out that like a lot of scientists in turn's time uh thought that esp existed based on some experiments that someone had done that later ended up having like severe issues but they they're like very subtle severe issues um so it's like yeah i think if you do kind of more philosophical stuff uh some percentage of it is going to end up looking like that but some percentage of it is going to be the turing test and you know i think the like increased recall of really good ideas like that is kind of worth the decreased precision uh i mean we obviously need sort of standards to kind of judge those arguments um but right now what's happening is all those arguments are happening uh kind of like next to the ml field rather than like within the ml field and so that i don't think that's a like that's not going to improve the quality of arguments it's going to be much better if you kind of have a community of people with on the ground experience also participating in this so i think that might be the biggest change i'd personally like to see you know now that we are we've begun sort of requiring sections we could force people to next to the broader impact section we could also you know do a philosophical musings section where you have to reflect on the long term and sort of paper clip stop maximizer style impacts of your work well yeah i'm not sure i want to force people to do that um uh it'd be fun yeah i think like i guess i'd rather have like a track or venue for kind of talking about these and also for the broader impact stuff to be honest because i think um a lot of the broader impact sections of these papers are kind of cookie cutter and people are just like filling it out because they feel like they need to add that section uh but you know there's other researchers who i think are super thoughtful about the broader impacts and have like really good thoughts um and so uh i like i'd like there to just be you know venues uh and like there are to some extent right but like i think there should just be like more of a culture of like yeah like let's have you know an essay about the broader impacts and like that's like a reasonable contribution or kind of you know this like very conceptual essay about like weird stuff that could happen in the future and that that's a valid contribution so i think that that's maybe what i want more of cool yeah that's a good message to all the people who think about organizing workshops and so on this would be neat topics that would make for interesting workshops certainly at conferences i'd certainly attend yeah it's funny because i also wrote a paper on troubling trends and machine learning scholarship where i argue against speculation um but i think actually it's not really an argument against speculation is really important it's that you need to separate speculation from the like solid stuff right if you have if you're like mixing it all together then it's just a mess but i think if it's kind of clearly labeled uh then you know that that's a much uh safer way to do things this workshop is an opinion piece good

Remarks

is there any last thing you want to get out to people about this topic something we haven't touched on yet that you feel is important yeah good question um no i think you did a pretty good job of hitting it maybe the other thing i would just say is i think uh like biology is a really interesting field where you also have kind of complex self-organizing systems and emergent behavior like we have in ml and so i've critically gotten a lot out of just reading a lot about the history of biology so i i'd recommend that there's a couple really good books one is the eighth day of creation um it's kind of long but very well written and um and i think if people want like a good non-fiction book i'd highly recommend it to people cool your blog is bound to regret right people can find you there yep excellent well jacob thank you very much for being here this was really cool yeah thank you i'll see you around yup see you around

Другие видео автора — Yannic Kilcher

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Дайджест Экстрактов

Лучшие методички за неделю — каждый понедельник