Technical advances are moving at such a rapid pace that it can be challenging to define the tomorrow we’re working toward. In The Shape of Things to Come, Microsoft Research leader Doug Burger and experts from across disciplines tease out the thorniest AI issues facing technologists, policymakers, business decision-makers, and other stakeholders today. The goal: to amplify the shared understanding needed to build a future in which the AI transition is a net positive.
In this first episode of the series, Burger is joined by Nicolò Fusi of Microsoft Research and Subutai Ahmad of Numenta to examine whether today’s AI systems are truly intelligent. They compare transformer-based large language models (LLMs) with the human brain’s distributed, continuously learning architecture, exploring differences in efficiency, representation, and sensory-motor grounding. The discussion probes what intelligence really means, where current models excel or fall short, and what future AI systems might need to bridge the gap.
The Shape of Things to Come podcast series: https://www.microsoft.com/en-us/resea...
Оглавление (14 сегментов)
Segment 1 (00:00 - 05:00)
At some point, I asked the model, "Can you describe how I think? " I my jaw dropped because I said, "This this thing knows me better than I know myself, but at the end of the day, it's a stoastic parrot, right? It's got the weights and I give it a token and it outputs a token. " So, like, are these machines intelligent or not? — The good thing about transformers is that they created this nice overhang where they happen to be the right architecture at the right time that we could get through these amazing things. There's all sorts of random coincidences that we expo are exposed to on a day-to-day basis. Most of it's not necessary, and the stuff that actually is necessary will stay on, but we're constantly forming new connections and then we prune the stuff that we don't need. In an AI model, if you were to do that, it would just go, I don't know, it would go bananas. I would. — Hi, this is Doug Burgerer. In today's podcast, I'm bringing on two AI researcher experts. And the question we're going to discuss is, are machines intelligent? And so we'll be debating the architecture of intelligence across digital implementations and biological implementations because the answer to that question I think really will determine the shape of things to come. I'm delighted to have two legendary guests joining me today in the AI space. Uh Nicolo Fuzi who is a research leader in Microsoft research and Subutai Ahmed of Na Corporation. Uh both of my guests today are very deep in AI, deep in the fundamentals, deep in the work. They're on the cutting edge, the frontier. And both of them have been a tremendous source of learning and knowledge for me. I'd like uh to ask each of my guests to introduce themselves, tell me a little bit about your background uh and you know what you're currently working on to the extent you can talk about it uh in AI. So, Nicolo, would you please start? — Thank you, Doug, for having us uh and having me here. It's so much fun. So, so I'm uh I'm Nicolola Fuzy. I'm a researcher at MSR. So, so Doug is my boss. So, I will be very very good to Doug in this podcast. No, but jokes aside, my own background is in um basian on parametric. That's what I started studying. So, Gaussian processes and things like that. uh and then uh equally I would say in computational biology um because I found it like one of the most interesting use cases uh for AI techniques and that kind of has been true throughout my career and uh pretty much like everybody else eventually I moved away from uh the uh kernel methods and the business parametrics and I started working more on uh language models transformer models um with a particular eye towards information theory and the connection between uh information theory and generative modeling and that's kind of uh one of the main things I do today other than uh kind of managing the research of people who do much more interesting work than I do. — I have to interject there Nicolo because uh you dragged a piece of uh of bait across my path. Uh you know in Microsoft research I have a management rule that I can't tell anyone what to do because we hire some of the best people in the world. you have to trust them and everyone is always completely free to call BS on me. And so, uh, Nicolo was joking there. He does not have to tow the party line. In fact, I encourage him not to. So, uh, — I just have to be well behaved. That's the only thing I will say. — Yeah. Thank you. Thank you for baiting me, uh, because he knew exactly what he was doing. Uh, and I love him for it. Uh, Subutai, can you tell us a little bit about yourself? — Sure. Uh, thank you so much, Doug, for having me. I'm really looking forward to the conversation between us all. Um, so I see myself fundamentally as a computer scientist. Um, you know, I've been studying computer science for longer than I care to admit. Um, and uh, but what something changed for me during my undergrad years. I decided to minor in cognitive psychology and I started to get really interested in how the brain works. And to me, understanding intelligence and implementing intelligence was the hardest problem a computer scientist could ever solve. Uh so I got very interested in that you know I couldn't see how to really commercialize that. I was very interested in making products and stuff. Um so I stopped uh you know working on that for a while. I did a number of startups doing computer vision um you know video processing a lot of that stuff. And then when Jeff Hawkins started Nmenta back in 2005 with the idea of really deeply understanding how the brain works and figuring out how to apply that to AI for me it was like all my worlds coming together this like this is what I had to do. Uh none of us thought it would take as long as it did.
Segment 2 (05:00 - 10:00)
uh we spent the last couple of decades really deeply trying to understand neuroscience from a computer scientist from a programmer's standpoint the underlying algorithms and that's really what I'm uh passionate about just trying to translate what we understand about the neuroscience to today's AI um and in terms of what we're working on today it's uh you know the human maybe we'll get into some of this the brain is super efficient in how it works power efficient energy efficient and we're trying to embody those ideas and trying to make AI a lot more efficient than it is today. — Great. I think we'll get into efficiency a little bit later in the podcast because that's a subject uh that's near and dear to my heart. You know, being a computer architect uh originally by training. I want to go back to uh you know, one of the reasons I got involved with new, you know, Subetai and I have been exchanging emails like discussing collaborations, you know, visiting each other throughout the year through the years. And the thing that really stuck with me was when I read uh one of the earlier books from Jeff on intelligence. And there was an example in the book that talked about how you know the human brain learns continuously. I think biological organisms in general learn continuously. And the anecdote that I remember was this anecdote if you're walking down your basement steps. You know, you're walking down the stair to your basement and there's one step that's always been a few inches off and you decide to fix it. And so you raise it so it's even with the others. And then the next time you go down the stairs, you don't remember. And you're wildly off. And you know, you hit that step, you hit it earlier or later than you anticipated. You go out of balance. You're flailing around. You get all this adrenaline. You think you're going to pitch head first down the stairs. Hopefully, you don't. And then the second time you do it, you're a little off balance, but it's not crazy. And the third time, you maybe notice it a little bit. the fourth time it's like it's your basement stairs. And so somewhere between that first time down and the third and fourth times down, there are molecular changes in your brain that have learned the new timing of your basement steps. And I remember just that example vividly from the book and that got me thinking, wow, this is so different from the way our digital AI works. Um I just I'll turn it over to you to comment for that and I think we'll go into the digital. — Yeah. I know that's a great example. I think it's uh it's remarkable how our brain is constantly modeling our entire world at such a granular level and we're not even aware of it perceptually. Like you know that example of the steps is probably not you wouldn't consciously be aware of it yet. If something is different about anything in your world that you're very familiar with, you'll instantly notice it and then you'll you know you'll update your world model. You'll adjust and you'll continue on. It's really remarkable how the brain's able to do that so seamlessly. — And a lot of that is based on neurotransmitters, right? Because there's just a you know when you have that physical reaction to I'm about to pitch down the stairs, you get a flood of transmitters that actually changes the way your brains learn or at least the rate. — Yeah. There's a flood of uh neurotransmitters and neurom modulators as well um that invoke change sometimes very rapidly. Another example, you know, if you touch a hot stove, that's the canonical example. you will learn that very quickly. Um so there's a lot of chemical changes that happen but it's also really interesting that we can update things and update our world knowledge without impacting everything else that we know. This is something that's very different again from today's AI models. We're able to make these changes in a very contextual and very sort of fine grain way. — So Nicolo I want to go and talk a little bit now to transformers. So I think you know you and I and Subetai have were all working in the AI field uh you know many years before 2017 when the transformer hit uh you know I I was building uh you know with my team hardware to accelerate RNN's LSTMs you know which had this awful uh loop carrier dependence you know uh the bottleneck computation and then the transformer was just much more parallelizable. So what do you think's really going on in these things? And maybe we could start. I know you and I have talked a lot about this. Maybe just start with the major blocks. You know, you've got the attention layer, you've got the feed forward later, you've got, you know, the encoder stack and the decoder stack and the latent space in between. Can you just kind of walk us through those pieces at a high level and tell us what you think is going on? Yeah, I mean I I have a very opinionated view of uh why transformers are so great. So maybe I'll inject it. I don't know. I don't know if it's a super novel creative opinion, but it is an opinion. So um I guess the two principle the two main components you already described, you know, the transformer layers and the feed forward layers. One of one way to think about them is how does information
Segment 3 (10:00 - 15:00)
in your context relate to each other and how do I what is every token referring to for instance in the case of langu transformers in language models. So so by context we mean like the information you feed through the model uh that the model keep continuously generating and appending to. So, so like your chat history, — your prompt, — so your what your chat history or your particular prompt in a chat session. — Um that prompt which is a sequence of words get discretized in a series of tokens. Tokens can be individual words can be multiple words kind of connected together. The way we go from words to tokens typically is through an algorithm that tries to basically collapse as much as possible multiple words like the dog maybe just one token uh as a first kind of level of compression to fit into the model. So it just tries to bring things together uh as efficiently as possible. Uh then there is that you know within these models there is a transformer layer. This transformer layer or this attention layer, sorry, tries to basically figure out what the uh the refers to uh the term the in the dog or the dog jumps on the table. Jumps refers to the dog. So that there is this kind of like uh mapping that happens in that layer. And then there is like uh fit forward layers which in modern large language models they store a lot of information like that's kind of like where the knowledge typically kind of sits in the things that the model just knows you know uh if you move if you slam uh your arm against on a cup of water on your table that cup of water falls off the table. That's something that the model kind of has baked in through reading a lot about cups falling off of tables when they're hit. — You have you had two cups of water off camera. Did you just knock one over? — This is a continuity in case we stitch this and the audience. — That's right. If they edit it, you'll see the water get higher and lower. Okay. — Exactly. And I just will mess with the stitching just by switching cup of water. — I'm looking to see if I'm getting any stink eye from the team here. But no, they all look pretty happy. Okay. Anyway, so so that's kind of um those are for me the two fundamental components and and the reason why I have an opinionated view is that um you know with honestly I do believe that RNN's and you know even state space modern incarnations of state space models are good enough to learn over these you know language data or whatever or vision data or audio data. The good thing about transformers is that there are they do two things very well. One is they get out of the way. They don't have this notion of everything has to be encoded through a state like recurrent networks. And two, they do that very computationally efficiently as [clears throat] you were saying. There isn't a computational bottleneck. And so they created this nice overhang where there happened to be the right architecture at the right time to unlock enough flow of information through the model that we could get through these amazing things. — Let me press you on one thing. I like you know in the attention blocks you can figure out which words or which tokens relate to which tokens. So I put in the prompt and it's finding all the relations and then feeding those relations up to you know the feed forward layer uh well the feed forward unit within a layer and you said that knowledge is encoded there. But then what does it really mean for those maps to then access knowledge but then you project it back into you know uh you know the output and then feed it up to the attention block in the next layer. Okay. — So it seems kind of weird that I'd be like accessing knowledge and then taking that knowledge merging it and going back to another attention map. — Well you can see it as a mixing operation that happens in the feed forward uh part of the layer. you know, like you're attending, then you're mixing and kind of like uh reproing to some space with higher information content or like a different level of information extraction and then you're putting it back into okay so let me do another run of processing — uh and uh and kind of attending and then I mix again and then I do it again. I think that the information that is present in the prompt and in the you know that has been baked into the weights get further and further refined whether that refinement is extraction of structure with or addition of I or aggregation into higher level concepts I'm not sure I think it's just structure gets extracted and things that are irrelevant get kind of pushed away but that doesn't necessarily mean that it gets aggregated through the architecture — so now I'm going to try to like restate what I think I hear you saying. So, you know, we're we're adding information and we're kind of adding information at a higher level, but not necessarily throwing away the low-level information, at least that's not relevant, right? Because, you know, if the higher level stuff depends on the low-level stuff, I have to have that first. And so then you get to the top of
Segment 4 (15:00 - 20:00)
the encoder block and you're in the latent space with all of that information kind of maximized. Is that a way to think about it? And if you agree, can you talk about what the encoder block really is and what the latent space is? — Um, I I tend to agree. Yes. I mean, there is there you're describing it. I think you're describing what I think is happening which is there is um given the context in that in your prompt and given the task that the model perceives or like figures out that you're doing it has to highlight and pull out the relevant information and it does that not by summarizing layer by layer but it does it by uh you know increasing the prominence of that information and suppressing other things. So I think that's ultimately what happens up to re up to the point where you reach this beautiful space point in concept space which identifies both your intent and the things that in the prompt and in the knowledge of the model that are necessary to solve it — and so one last question and then I want to go to subai for a sec. So now when we go through the decoder stack um are we just going the other way and stripping out the high level concepts early and then getting down to the granular tokens or you know because you go up through the encoder stack those attention blocks and feed forward layers to get to that magical latent space and now we're going to go the other direction. How do you think about that other direction through the decoder stack which is the same primitives as the encoder stack? — The same primitives. Uh it's it you can think of it as kind of the the reverse operation like you you never lost information throughout. You just kind of suppress or privilege different kinds of information and now you're you're basically just projecting it back out to a space that is you know uh intelligible. uh and it's kind of where the model gets its I I hesitate to use the term reward because it has a particular implication but that's kind of where the loss gets computed and then gets pushed back through the model like — right as you're trying to evolve and train all those parameters the relationship between words the information and the feed forward layers the design of that latent space and the extraction of the knowledge from it. — That's right. And so in encoder decoder model, you push through the whole thing. You decode back to a particular token, which for people who don't know, it's like literally a number out of a vocabulary like word number 487. And if it was word number 1,500, you get, you know, like a bad reward. — Yeah. And then — and if you got it right, you get a positive signal that then just flows back through the mouth. — I'd like to go over to Subutai now. So after hearing this, you've studied, you know, neuroscience and the neoortex and cortical columns and all of this for a long time and you and I have had lots of debates. Uh is the human brain doing something different than that? You know, are we just building latent spaces then extracting? The architecture is very different, but what's going on under the hood? — Yeah, the architecture is very different. Um you know as Nicola was describing what happens throughout a transformer stack I was trying to relay and relate you know what we know in the brain as well in a typical you know transformer model or there is at the end of the day there is a single latent space from which the next token is uh output that does not happen in the brain. There are thousands and thousands of latent spaces that are sort of collaborating together if you will. Um, you know, a lot of what we publish is under the moniker the thousand brains theory of intelligence. And Jeff has published a book a few years ago on that. Um, and that kind of dates back to discoveries in neuroscience from the 60s and 70s by the neuroscientist Vernon Mount Castle who was a professor at Johns Hopkins. And y — what he discovered he made this remarkable discovery that you know our neoortex which is the biggest part of our brain that's where all intelligent function happens — uh is actually composed of roughly a 100,000 what do you call cortical columns — right — and each cortical column is maybe 50,000 neurons and there's a very complex uh microcircuit and micro architecture between the neurons in a cortical column but then there's 100,000 of them Um, and every part of your brain, whether it's doing visual uh processing, auditory processing, language, thought, uh, motor actions, they're all composed of this essentially the same micro architecture. And this was a remarkable discovery. It's it says that there's a universal architecture. It's not a simple one. It's complex. Uh, but it's repeated throughout the brain. And that's where this, you know, the idea of the thousand brains. Each of these cortical columns
Segment 5 (20:00 - 25:00)
is actually a complete sensory motor processing system. Uh it has inputs, it has outputs, it's uh getting sensory input, it's sending outputs to motor systems. Um and it's building in our theory complete world models. So there isn't a single latent space. There's thousands of these latent spaces. And each little cortical column is trying to understand its little bit of the world. You know, one cortical column might be getting at the lowest level maybe one degree of visual information from the top right hand corner of your retina. Another one might be focusing on specific frequencies in the auditory range. You know, each one has its own little view of the world and it's building its own little world model and then they all collaborate together. There's no top or bottom here. Um there there's no homunculus in the brain. Everything is sort of equal. Um and they're all simultaneously collaborating and voting and coming up to you know what is the you know consistent uh interpretation of all of these sensory invest uh inputs that we're getting. What is the single consistent uh you know concept if you will and based on that make the motor actions that are most relevant to that. So it's a sensory motor loop. It's a you know it's a constantly recurring uh system. We're constantly making predictions. Uh as you as we discussed earlier, you know, there's we are constantly learning. Every cortical column is constantly updating its connections, constantly updating its weights. It's building and incrementally uh improving its world model constantly. So it's a massively distributed, you know, set of um processing elements that we call cortical columns that are they're all equal operating in parallel. So there I think there are similarities for sure between them but at least the way I described it I think it's very different in uh in its operation than what I understand today's uh LLM's to be. I don't know if you agree with that or not. — To better understand I had a question which is um are these cortical columns relying on the fact that these are essentially multiple views of the same process and those multiple views like the you know the part of the sensory input that gets allocated or subdivided is it happening at the same time point. So in other words, if you could artificially Oh, sorry. Go ahead. — No, finish the thought. I'm gonna — if you could artificially delay uh by some time t some cortical columns with respect to the rest, — would the learning suffer? — Uh and so in other words, how important is it that it's kind of on the same schedule? — Yeah, I mean that's another I mean LLMs today, you know, you get your input, you one layer process it, then the next and the other layers are not operating in the brain. It's not like that. everything is working operating in parallel asynchronously and this is important. They're constantly trying to make uh predictions and so on. So if you were to artificially slow down some of your cortical columns you would absolutely suffer. Your thinking would — I wanted to interject here just because this is where this discussion is where you know I got super interested in the difference and then spent a bunch of time with Sumatai to learn from him. So in the if I think about my skin you know which is an organ uh you know there's a as I understand it there's a cortical column attached to a patch each patch of my skin and the size of that patch kind of corresponds to the nerve density there. So you can think so in my brain there is a set of cortical columns that are skin sensors — and I could actually if I numbered all the cortical columns in the brain I could draw a map on my skin and say this is number 72 in this patch this is number 73 in this patch. Um now are human cortical columns like better than say what we see in a mouse and of course this is a leading question because I know the answer. — Yeah. So yes it you know cortical columns in your sensory mo sensory areas primary sensory areas each you know uh pay attention to or get input from a you know some patch of your skin somewhere on your body and there's many more cortical columns associated with your fingertips than there are than your you know a square centimeter of your back for example. So there's definitely, you know, areas of sensory uh information that we pay a lot more attention to and devote a lot more physical resources to. — Um in terms of a mouse um and humans, um what's it's pretty remarkable that the cortical columns, so all mammals have cortical columns. All mammals have a neoortex. All mammals have cortical columns from a mouse all the way up to humans. And mice have cortical columns that are very similar to what a human has. It's not identical. There are differences but by and large there's the architecture of a cortical column in a mouse is you know very similar uh to coral columns in
Segment 6 (25:00 - 30:00)
humans. Uh human coral columns are bigger there more neurons and there's more detail uh there but essentially it's the same and — maybe just scaled up a little bit. — Yeah. So, so evolution basically discovered this structure that it's really excellent for processing information and dealing with it and then through you know very fast in evolutionary time basically figured out if you could scale up the number of cortical columns you get more intelligent animals and that's what happened very fast evolutionally — I didn't know about the unevenness of cortical columns present like this is not I'm not a neuroscientist Um and so I this is interesting because one of the biggest frustration with many modern architectures of models is that they uh deploy a constant amount of computation no matter what the input is. uh so uh to the the I go through the same number of layers whether I'm trying to predict the word dog after the uh or whether I'm trying to solve like give the final answer to a very complicated math question or you know whether a theorem was proven or not by the in the prompt and so that's interesting because mo like some current instancation of modern architectures actually deploy try to cluster things together such that you have a constant amount of information that you then push together through the model. Uh and so maybe like on my fingertips I need more uh processing than I need on my elbow because like you know and so this kind of makes sense. — Nicolo is being humble. He was working on this problem two years ago and told me about it and it was one of the things I learned from you that made me think differently. So uh — I just like to refer to people working on this. — Yes. Random average people who are not all necessarily brilliant AI scientists. Um, so the prediction part of this though is really what's fascinating to me because again something else Sub and I discussed many years ago. You know, if I'm like moving my finger towards the table, I guess I'm now looking to see if the viewership can see. Yeah. And so my brain is making predictions because I have a world model. It knows a table is there. and the cortical columns representing that patch of skin. As it's getting closer, they're starting to predict that I'm going to feel something that feels like the table and oh, there I hit it. Prediction met. But if I touched it and it felt really icy cold or super hot or fluffy or not there, I passed through it. I'd get a flurry of activity because the prediction wouldn't match the world model and that's where learning would happen. Subatay, does that sound like the right model and intuition? Yeah, that that's definitely a that's a very important component of it. We're constantly making predictions and as you said, you know, you're moving your right hand, right fingertip down. Uh you've ne you know, perhaps you've never sat in this room before or you know, seen this table before. You would still have a prediction, a very good prediction of — because you know what a table is. — And if it was different, you would uh you know, you would notice it right away. But if your left hand, which you weren't paying attention to, also felt icy cold and then you would not notice that as well. So you're actually making not just one prediction, you're making thousands and thousands of predictions constantly about every column. Every cortical column is making predictions and if something were anomalous, highly anomalous, you would notice it. Um so this is something you know we don't often realize we're making very granular predictions uh constantly and when things are wrong we do learn from it. Um and the other interesting thing and this is again possibly different from how LLMs work. You know, if I were to tell you to touch the, you know, the bottom part of the bottom uh surface of the table, you could without again without looking at the table or opening your eyes, you would be able to move your finger in and touch the bottom of your table because you have a, you know, set of reference frames that relate to There you go. Yep. You're able to do it. — I did it. Yeah. Amazing. — Even though you've maybe never been in this room, maybe never seen this table before, it doesn't matter. been in this room because we had to prep for the podcast series, but I didn't touch the other side of the table. That's for sure. — Yeah. Exact. Exactly. So, you know, we know where things are relation to each other, where our body is in relation to everything. And we can very rapidly uh learn. And again, if the bottom part of the table was anomalous, you would know, you would notice it and potentially remember that. — I'm not going to lie, I was expecting you to find something under that table like a like a talk show — or chewing gum. If you reach under the table, you're going to find a copy of my paper. — You know, if I was smarter and better prepared, that's exactly what happened. What would have happened? But uh sorry guys. — Um I think you told me something you know that and I'll give a little bit of
Segment 7 (30:00 - 35:00)
preamble. So you know the brain has these dendritic networks in each neuron and they form synapses and so a neuron fires and that you know the axon of the neuron that's firing will propagate a signal through the synapses which might do a little signal processing to the dendrites of the downstream neurons and those downstream neur the dendrites can then prime the neuron to fire. That's one of the fundamental mechanisms and it's the formation of that those synapses you know between the upstream and the downstream neurons the dendrites that seem to uh be the basis of learning — and to me that feels a little bit like an attention map. Um so maybe the dendritic network is doing something akin to self attention and we have some work going on in that direction in MSR. But the question I the thing you told me was that your brain is actually forming an incredibly large number of synapses speculatively in some sense sampling the world in case — when something happens in case it will recur. You know it's a more maybe it's a version of heavy and learning right? You know things that fire together wire together. — Exactly. But then if if that pattern doesn't recur, then they get pruned. And I'm just going to you know what is the fraction of your synapses that get turned over every three or four days? You know, ballpark. [clears throat] — Okay. Yeah, I remember this. This is an absolute mind-blowing study in neuroscience. There's um and so you know the way a lot of learning happens in the brain is by adding and dropping connections. In AI models, it's usually strengthening, you know, high precision floatingoint number, making it higher or lower, but you're not adding and dropping connections. The connections are always, in fact, everything is — fully connected, right, between layers. And so, um, in the brain, you're always adding and dropping connection. That's a fundamental mechanism by which we learn. [clears throat] um and one of the fundamental mechanisms and this what I've read in a study is that when they looked at adult mice and adult um animals and what they found is that they would look at the number of synapses that were connected uh over the course of a couple of months. uh and they were able to trace individual synapses in this particular part of the brain. And what they found is that a every four days uh 30% of the synapses that were there were no longer there 4 days from now and there was a new 30%. And there's a huge number of connections that are constantly being added and constantly being pruned. And my theory of what's going on there is that we're always speculatively trying to learn things. So, um, you know, there's all sorts of random coincidences and things that we expo are exposed to on a day-to-day basis. We're constantly forming connections there because we don't know what's actually going to be required and what's real and what's random. Most of it's not necessary. And the stuff that actually is necessary will stay on. But we're constantly trying to learn. This is a part of continuous learning that's often not appreciated, I think, is that we're constantly forming new connections and then we prune the stuff that we don't need. And an AI model, if you were to do that, it would just go I don't know, it would go bananas. I would — Well, so let's double click on that. So when you told me that the way — it's mind-blowing this 30%. Like your brain is going to be totally different a few days from now. — Yeah. Uh it's so mind-blowing and the when you told me that I spent some time processing it. So a whole bunch of synapses were created and destroyed during that time. But it just made me think that we have you know we have all of these columns getting all of this input continuously. you know, eyes, hearing, smell, taste, skin, uh, heat, and then and then, you know, interactions with people and then planning and experiences just at every level. And they're constantly sampling all this noise coming in and basically filtering out the noise. It's like kind of like a lowass filter. And then but when something statistically significant recurs it's gonna it's going to lock and then become persistent. — Yeah. I I think so. I think we're um you know constantly there's so much that's happening constantly learning and you know that when there you touch a hot stove or something there's a flood of dopamine uh specific to those areas uh that cause these synapses to strengthen very quickly. um you know most of these synapses that are learned um are very weak synapses and so yeah you know when you look in this study they also quantified the turnover and kind of strong synapses versus weak synapses and it's comforting to know that the strong
Segment 8 (35:00 - 40:00)
synapses stay there. It's really these weak synapses are constantly added and dropped and then some then will become strong. — Now I want to go back uh return to Nicolo but with an observation. So when I'm training a transformer uh it's also a predictionbased system you know I'm running I have my input uh in the training set I'm I have the my masked token or the next token I'm trying to predict I run it through I look at how successfully did it make that prediction and the worse it was the sort of the steeper the error or you know I drive back through the network. So, you know, if it's spot-on, I don't learn very much. But if the prediction is way off, I've got to change a bunch of stuff. That sounds analogous to what Subetai was just describing in the cortical columns. No, that that's right. I mean, the uh with the I don't know with one big pet peeve of mine in pre-training in particular around pre-training this language. Okay, — again for context like language models in particular but you know many other instantiation of large models are trained in a few phases usually one of them is pre-training where you have uh some ground truth text and you remove let's say just the last word and then you ask the model to predict the last word and that's when you get that loss you do you get the word right wrong — um one of the big problems that I have is that uh you know in human experience, we do not get feedback every single thought. The problem with language models, the way we're training them, at least in pre-training, is that they do that thing called teacher forcing. So they guess a word, then they get immediately the signal and then the right word gets filled in and then they predict the next one. So when you go through like a passage of text, you constantly get this reward. And it's such a bizarre way to train a model. It's necessary because you want uh a lot of flow of supervision uh like you want like a lot of supervision to essentially use all the computation available but at the same time it actually makes the models arguably a little bit worse than what they would be if you had enough compute to train them without this. Uh I went on a tangent just because it's a pet peeve. It's a really important point though because your goal when you're training a model is to get to your loss target with the minimal cost and time uh or of course like fixed budget and like lowest loss target. But you know, biological systems also their goal is survival with energy minimization. And so like once you've built a world model that works, right? Like touching the table, touching the underside of the table. Nope, still nothing exciting there. I like it takes very little energy to do that. And I think a tragedy is that we have these we all have these supercomputers in our heads. You know, the neoortex is what about 10 watts and it's this amazing thing, right, that can compose symphonies. And then once but once we have a world model a lot of us just stop learning because it's comfortable right you don't have to perturb the state you can go through and you know I mean how many of us go through every day and all of our predictions succeed and there's no surprises you know so all the new synapses get swept away right that's not a goal of pre-training because then you're just wasting energy but we're trying to minimize energy consumption so it does feel kind of aligned to me in some sense so let I I've got a straw man I want to hit you with, but I before we do, um, Nicolo, I want to I want you to talk about your view on compression, like LLM's compressors, because I know this is something you're very passionate about and opinionated about, and I've learned a lot from you on this, too. So, and then Subutai, after this, I'd like to hear your biological response. I mean, your response from a biological perspective — and uh and then Yeah, that's right. Of course. And then I want to try like I want to throw out this hybrid straw man. So Nicolo tell us about compression. — The view is that basically the gener generative models are compressors and basically trying to in information theoretic in an information theoretic sense. [clears throat] — Um and so trying to create come up with a better generative model is equivalent to trying to find the best compressor for some data. — Uh and now when you say compressor do you mean lossless or lossy? — I mean lossless. uh you can basically look at the literally my much maligned uh objective function that you use for pre-training which is you know next token prediction and uh and you can basically draw a complete uh parallel to what you would do if you were trying to come up with the you know try to do compression which is coming up with the shortest possible code uh for some for something that you're trying to compress and so the two things are the game and it kind of fits into a broader picture uh that you know like
Segment 9 (40:00 - 45:00)
goes back to Okam's razor and uh commander of complexity and Solomon's principle of induction which is um you want short descriptions for likely things that happen in the world and you want your algorithm that produces those short descriptions to be also short. That's a minimum description length principle. And I do feel like it fits in in kind of also what you were saying about the concept of you have a good word model, why look for surprise? Uh because it's simultaneously it affects both terms, both the algorithm like your own word model, but also the loss that you incur when something unexpected happens. And so if I'm a an agent in the word trying to minimize the minimum description length of the word, I like to go and seek some in distribution data such that I don't bump up my surprise term too much. — Right. And you said and I think you said at some point that you know when I'm training a model even though I get to the same loss point you know between model A and model B if I have a steeper loss curve in model A than model B you know it's getting to a better sort of compressed uh base vocabulary faster which makes it more general. The shape of that curve matters from a compression perspective. — Yeah. Yeah, I mean I think it would help here to expand on what I was talking about in terms of the minimum description length principle. The minimum description line principle is basically the loss of the model you're training. That's one component. And so it's a sum over the mistakes you make at predicting or the dist, you know, each word. And that's one term. And the other term is the how long it takes you in code to describe the model and the training procedure to get to that training curve to produce that training curve. — So yes if you look at collectively one term is kind of fixed. It's a cost. It's a amount of code it will take you to write out a language model for instance in code like you lally implement it. Uh not the weights. Uh just implement the initialization of it and then the training loop. And then on the other side you have this training loss that gets generated as you start observing data. And of course because it's a sum you want to minimize really the area uh of like you want to minimize the sum. And so like a like a flatter curve is much better than like a than a steeper curve even if it ends up at the end to be slightly better. — Yeah, con concave is better than convex — among other things. — Sorry. So you know I think that we could do a whole episode on this compression view because it's really fascinating and the lossless part of it is what blew my mind. Um, and I think, you know, I'm guessing there are multiple camps here and you're squarely in one camp. So, so subai, you know, can I think of cortical columns as compressors? Yeah, it's a good question. Um, you know, I you there's so much in the compression literature that you can draw insight from. Um you know if you look at the representations in cortical columns and that populations of neurons have um you know some of the things uh you have to deal with are the that the brain doesn't have a huge nuclear power plant attached to it. You know we only have 12 watts uh or so to uh to process everything we want to do. And what the representations that the that evolution has discovered are incredibly sparse. And what that means is that you may have thousands and thousands of neurons in a layer, but only about 1% of them will actually be active at a time. And so it's a very uh small subset of neurons that are actually active. Um I don't know about this minimum description length whether that applies. I can say a couple of things about that. There's, you know, by and large the representations are very sparse when you're predicting. Well, when you see a surprise, uh, there's a burst of activity. When there's something that's unusual, there's a lot of lot more neurons that fire. And — that's why learning is tiring. — That's what learning Exactly. No, that that's right. Um, and so what we think is happening is that, you know, the actual representation of something is a very small number of neurons. When you're surprised, there may be many things that are consistent with that surprise. And so your brain represents a union of all of those things at once. — And when [clears throat] you have a very sparse representation, you can actually have a union of many different things without getting confused. So that's what we think is going on there. So it is a very compressed very efficient representation. Um and because it's such a small percentage of neurons that are firing, uh we are very parsimmonious in how we represent things and extremely energyefficient uh metabolically. — I wanted to get to the efficiency point
Segment 10 (45:00 - 50:00)
and but before I do — you know you talk about this one you know 1 to 2% of the neurons firing but it's actually the brain is actually much sparer than that at a fine grain right because you know that you have 1% of the neurons firing but they aren't connected to all the other neurons in the region — you know so if you so really the sparity should be the product of the connectivity fraction times the activity factor. — Yeah. Right. And that's about one out of 10,000 something like that. — Exa. Exactly. Yeah. So something like maybe 1% of the neurons are firing at any point in time and maybe 1% of the connections that are possible are actually there at any point in time. So it's a very small — you know sub network through this massive network that's actually being activated. A tiny percent of neurons going through a very tiny piece of the full network. Um, you know, it's common to, you know, some people say, "Oh, we're only using 1% of our brain. " That's not true. It's just means at any point in time you're only using 1%. But at other points in time, a different 1% is being used. So, you know, the activity does move around quite a bit, but any point in time, it's extremely small. Okay. The sparcity, I think, you know, the representation, how the brain is doing this compression biologically is super fascinating. And I want to go on a little bit of a detour now to efficiency. So I remember in 2017 when in MSR we were building you know hardware acceleration for RNNs and then the transformer hit and they were optimized you know to be highly parallelizable across this quadratic attention map for GPUs. The way I would describe it is that that transition um to semi-supervised training moved us from an era where we were really data limited like you had to have good highquality labeled data to you were compute limited and when that transition happened we hockey stick from I'm building faster machines but I'm limited by data to the bigger a machine I can build if as long as I have enough you know unlabeled data of high quality, the better I can do with a model. And so we went on this supercomputing arms race and now we're building these like just gargantuan machines. Um and really we've kind of been brute forcing it. I mean we've done a lot of things to optimize like quantization you know and other and you know better process node you know better more efficient tensor unit design but to first order we've been training bigger models by building bigger systems and I just wonder do you think that the brain at this 10 to 12 watts in the neoortex has just has a fundamentally more efficient learning mechanism or do we think that you know what we're doing in transformers in the most advanced silicon is as efficient. We're just building much larger, more capable models. — Oh, I think without a doubt transformers are extremely inefficient and very brute force. Uh we touched on this a little bit earlier in the attention mechanism where we're you know transformers are essentially comparing every token to every other token. I mean that there are architectures which reduce that for sure but it's essentially an ns squared operation and we're doing this at every layer. I mean there's nothing like that in in the brain. Uh our processing, you know, in some sense the context for the very next word I'm about to say is my entire life, right? It's it and the amount of time I take to take the next word doesn't depend on the length of that context at all. It's a constant time uh dependence on context. So it's a significant um you know reduction in in the compute that's required. — You can kind of think about like the brain I think has somewhere around maybe 70 trillion synapses. When I say the brain I mean the neoortex has about 70 trillion synapses and it's using only 12 watts. And a synapse is roughly equivalent to a parameter. And if you were to take the most efficient GPUs today and try to run a 70 trillion parameter model, it would be something like a megawatt of power. It it's tens of thou it's orders of magnitude uh more inefficient than what our brain is doing. — So I believe — the metric I used to go back to your point, you know, is I this is something I think we talked about this back in the day, right? when uh you know this after this kicked off for a few years we were trying to project like how far would this go under the current model to inform the research and the directions you took which is why I got so interested in sparsy and working with you and we would look at a training run and just say how many jewels did it take to train the whole model how many parameters do we have uh and sort of what's our parameters per jewel and if
Segment 11 (50:00 - 55:00)
by that metric you know we were off by many orders of magnitude where the brain is but I don't know that that's the right metric So any thoughts on that? — Yeah, I mean in some ways you know transformers you know embody more knowledge in them than any human has. It's sort of it has memorized you know the entire internet's worth of knowledge — all scientific papers you know good and bad whatever that you know it's memorized everything. So that's something that you know humans just cannot do. So there's definitely stuff that that's better in transformers than humans, but fundamentally I think, you know, we're extremely efficient in how we process uh the next token or the next bit of information that that's coming in. Um and that's some I think there's a lot we can learn from the brain and apply it to LLMs and future AI models there. I was going to ask a question related to that because forget memorizing the internet but let me give you another example that transformers do really well and I'm wondering like you know the human aspect of this and or the brain aspect of this because transformers because of the n square computation they're really good at stuff like needle in the astack. So I can tell you right now I can speak, I can talk to you and I can tell you the password is something silly like podcast microphone blue whatever that's a password and then I can proceed and read the entire odyssey or a bunch of other books to you out loud for the next five or six hours and then I can ask the transformer what was the password and transformer will do this nice n square computation many times and will spit out the password a human you know there will be a decay of that password and then at some point they won't remember and depending on the human it may be at the first chapter of the Odyssey or like at the end but so fundamentally the type of computation that is done is very different. So uh it always makes me wonder about the efficiency because it's just like it's a different type of computation. So the efficiency of like efficiency is kind of like what are you doing divided by how good are you at doing it and so when the things we're doing are so incompatible in many ways that always makes me always troubles me a little bit — I don't know if there's a question in there — yeah I mean transformers can do stuff that humans find very difficult uh to do um absolutely um you know maybe there's a way to get the best of both I don't know um you know I don't know that it's fundamentally necessary to have such brute force computation to get every all of these uh features. — That's right. — Yeah. It is a weird thing because you know this is why memory palaces work so well. Like there is a way though for a human to remember that my microphone is gray. It's not actually blue. Nicolo. Um — mine is blue. You don't see it. It's off camera. You see your — camera. Yeah, I know. I was just teasing you. Um, but there's a way like if I can just connect it to enough things, get that connectivity graph, then I'll remember it because it's captured the signal out of the noise and connected to enough things I can retrieve it. Um, oh, and retrieval will be a whole another topic we don't have time to get into today. I But I do now I want to go to the straw man. So, let's take continual learning off the table. Let's imagine that as I go through my day, I'm just saving all of the sensory data to put in my training set. And now imagine that I take 100,000 little transformer blocks and I'm training them each with what they're seeing. Okay, I replay the day. uh so I don't have to again worry about continuous learning and whatever cross uh cortical column you know routing feature of the outputs the inputs and there's superai we've talked about there's a complex set of wiring there to bring features from here to there that gets learned — if I replicated that could a transformer block — kind of do what the uh the cortical columns are doing is like could I just instrument all my sensory patches with little transformer blocks and then wire them up in the right way and have it work. — Um I think there'll be a there's still a couple of things uh we need. Um one is that cortical columns are fundamentally sensory motor and so they're actually each one each cortical column is initiating actions as well. So you cannot have a static data set fundamentally uh ahead of time. It's always a dynamic because we're constantly making movements to get the next uh bit of data. And so — couldn't I tokenize that though? — Um I mean you could tokenize the input and you can tokenize the output but you know if you were to play that same set of inputs back again to a network that a cortical column that's randomly wired differently it may make a different set of actions. — And so as soon as it makes the first action that's different that data set is no longer valid. Right? It's um you
Segment 12 (55:00 - 60:00)
know there is you can't fundamentally you have to have a simulation of an environment rather than a static one-way data set if that makes sense. Um so I think that's one piece that's uh I think missing in transformers uh today is this sort of sensory motor loop. Um and then the other piece we talked about is continuous learning. I guess you said take it off the table but that is the fundamental different. Yeah. — Yeah. And maybe one other difference uh we talked a you know much earlier about a single latent space and a prediction that's being made at the top of the transformer that you compute the loss function and that's back propagated through the transformer. That's not how neurons learn. Um neurons are making every neuron is actually making predictions and every neuron is getting its input and it's learning independent of anything that happens at the top. And so it's a much more granular learning signal and information does flow from the top to bottom but there's also many other sources of information that it that it's learning from. Um so it's it's different in that sense as well mechanistically. — The reason I ask and now we're now I'd like to get into you know some of the fun speculation because I've just it's been a phenomenal discussion with the two. I think we've kind of elucinated the differences. something I've wondered after I've talked to both of you and you know Nicolo kind of learning about this compression view of the world lossless compression and subetai just you know the thousand brains theory and these cortical columns and the sampling of you know the world to capture the signal that you can learn from so let's say that I was able to design a really small efficient digital cortical column maybe it's transformer-based with some you know a sparse representation and some sensory motor mechanism built in. Maybe it's more dendritic based, you know, mapped into digital hardware. And I put that I put those a cortical column on every sensor I have in the world, every associate them with every person and wire them up together with some of this and then have a you know billions of them that can form higher level abstractions like what do you think would happen? what could we do? — That's a fantastic uh thought exercise. I think that's um you know again assuming the coral column is faithful and can generate you know or suggest motor actions as well. I mean in some sense you could potentially have a super intelligent uh system right that that's far more intelligent than anything else on the planet. Uh now we're scaling the number of cortical columns uh you know not from a mouse you know to 100,000 columns that a human might have but potentially billions of cortical columns uh and way more and there's no reason to think there's any fundamental limit there. Uh so this sort of a system is I think the way that super intelligence systems will eventually be built. Um, — but this is a very different direction — than we're currently headed down with like these monolithic models where we're doing tons of RL, you know, to capture uh, you know, to get high value uh, human collaboration in distribution. — Yes, it's completely different than uh the direction we're proceeding. So I think that you know to go down that path there needs to be a fundamental rethinking of some of our assumptions potentially even down to the hardware architectures that are uh necessary to implement it. The you know fundamental learning algorithms the fundamental training paradigm we talked about you know you can't have a static data set you're constantly moving around in the world and doing things. So it's a very different way of going about AI than what we're doing today. Sounds like a great time to be an AI researcher. — Absolutely. — Nicola, what what was your reaction to that hypothesis? Like — it sounded it sounds super interesting. I mean, my brain was churning, you know, I my background is very different and so like I'm much in a much worse position to ask to answer this question, but I was starting to think, okay, so let's say I do this. What would be my loss function? uh what you know how would information flow through this system like sounds like cortical columns would each have their own loss that then I would aggregate and then I would add a contribution that is like higher level and then back to my question you know like I how is the temporal information coordinate because one you know one way to see this is that you know the way I'm coming to understand this is that it's kind of like a multiv- view framework you have the same phenomena represented through multiple independent but at the same
Segment 13 (60:00 - 65:00)
time views and so part of me is like it feels like that you need to tie together these cortical columns uh in such a way that they all get that gradient feedback if you're training with gradient based methods for instance and so that that's kind of it feels super interesting. It is related to a lot of you know very superficially it is in machine learning around hey is it better to have one giant super deep network a bunch of shallow networks but the difference is also in the way you train them right we typically train these bunch of shallow networks on kind of the same objective and the same data and not typically into an experiential cycle uh whereas sounds like this is a different um a different way to do it — I think I want pull this back around to the title of the podcast. Um, and so I I'll share an observation. You know, so I I've been using some of the latest models to code. You know, they're getting better really fast. I've been using them to kind of relearn some of the physics that I've never really understood deeply. You know, special and general relativity, like E= MC², like why is C in there at all, right? Just stuff like that. because now it can actually explain it to me and I can keep beating at it until I understand it and then of course work and at some point I asked the model can you describe how I think and I was just curious and it you know it gave me a page description that I my jaw dropped because I said this thing knows me better than I know myself. I don't think any human being including me could have captured kind of the way my approach to learning in my brain works and I just read it like I like yep that's right and I learned something about myself. So I wouldn't say that it passed the touring test because this was way beyond touring test. This was like this thing knows me way better you know than I thought any machine ever could. I mean I'm having a conversation where it could be human but it's superhuman. So in some sense it's like [snorts] intelligent beyond uh hum human capabilities with its ability to discern patterns in how someone's interacting and yet it's a tool. You know it's not conscious. It doesn't have agency embodiment emotion. It understands a lot of that stuff from the training data. But at the end of the day it's a stoastic parrot, right? It's got you know it's got the weights and I give it a token and it outputs a token. So like are these machines intelligent or not? — I'll let Subuta answer first. — Okay. Uh you know it's definitely a savant, right? It's it knows a huge amount about the world. that's absorbed a lot of stuff and it can articulate that in ways that are just amazing and you know it's taken your chat history with you know presumably thousands of chats and able to summarize that in a way that that's remarkable. Um, at the same time, I think, you know, transformers are not intelligent in the way that a three-year-old is, right? A three-year-old human is constant is very curious, is constantly learning. It can learn almost anything. Um, and you know, a three-year-old Einstein was able to learn and eventually come up with theories that shook the world that, you know, E equals MC². Um, and so, you know, could a transformer do that? I don't think so. And so I think there's still a difference. It there's things they can do that are amazing. Um but there's still basic things that a child can do that transformers cannot do. So I think there's still a gap there. Uh ex exactly how to articulate it and how to bridge that gap is of course the trillion dollar question, but it is bridgeable and there is a gap today. — Right. Nicolola — I you know I think from my perspective they are intelligence like they are intelligent and from my perspective I go back to the definition of intelligent which is like can you achieve your objectives in a variety of environments uh it's a very basic fundamental but it's kind of you know it can be embodied a form of embodied intelligence and agentic intelligence if I plop you in an environment and I give you an objective can you achieve it and the in the wilder the environment the uh the harder the task is and I do think I agree with like there is a juggedness of intelligent we keep describing like these things cannot be simultaneously super good you know Olympiad level mathematicians and still give you stupid answers when you're trying to I don't know figure out which cable goes where in your in your car's battery you know like whatever — well then it's better than me I'm not olympia level mathematician and I do stupid stuff all the time. — I know exactly. Well, you know
Segment 14 (65:00 - 68:00)
whatever. That was a bad example, but you get it. But but part of me it goes back to the compression view. Like I do believe that um intelligence is compression. So the ability to come up with succinct explanations for complex phenomena and even words and then implies or leads to your ability to operate within them. And the fact that we are these things that they can prove crazy theorems but at the same time fail at fair fairly rudimentary tasks is a sign that the uh yes transformers are great in terms of inductive biases they put in the on the word and computationally are great but um uh we're ultimately all subject to the no freelanch theorem. you know across the board the set of tasks that you could be pursuing uh you know you're you have certain inductive biases that kind of privilege certain tasks at the expense of others and there isn't like a thing yet that has expanded our set of tasks that are addressable and so I do think that it's a matter of rethinking our approach to a few things whether I think likely both on the architectural front and on the losses and the way we train these systems front. I think there is a opportunity to expand the intelligent frontier of these models. But yeah, from my perspective, they are intelligent already just in a jugged way. — It's such a it's such an interesting question and I know a lot of people write a lot about this. So this is we're not I don't think treading any new ground here. But you know there's the diversity of the tasks you can excel at. You know, are you able to handle nuance and understand things deeply? Uh, are you able to learn continuously? Right now, the systems can't, right? Are you embodied? I don't know if that matters. Do you have an objective? Well, we could give them one. Um, are you conscious? Is that I mean, that's a whole another thing. So it just feels like there's a bunch of check box or check boxes and we've checked a bunch of them and are unchecked and maybe there's no consensus on like where that threshold is because there many dimensions of intelligence and some of which humans don't even have — well and that's why we have the term AGI and ASI and people are debating the G and the S what is general what is specialized so there is like it's a huge discourse like for sure um but that's why we had to start characterizing but if you go back the definition from you know going back to my schooling go back to the definition of intelligent from plateau and Aristotle and the scartis like in some sense you see the goalpost moving through the centuries around what we define as in as intelligent and I feel like we are still doing it — yeah we'll be doing it for a long time uh you know which in AI velocity is probably another like four or five years hey I just want to thank you both uh for the uh the dialogue. You know, I I treasure both of you as you know, intellects and scholars and friends. It was just a joy to nerd out with you all. So, thank you both for taking the time. — Thank you so much, Doug, for having me. — Thank you for having us. This is great.