— Well, I've been using Claude code for quite a while and yes, have been playing about with the new Claude co-work. And for me, those predictions are just not true. But so many of us might then throw the baby out with the bath water and miss out on some pretty crazy productivity gains. So, I'm going to show why we shouldn't underestimate the gains to be had either. Then, for those who want to go a bit deeper, I'm going to end with the why. Why can models produce genius like seeing tiny bugs in large code bases and writing for me powerful poems but also still fail at such basic tasks? No, I don't mean how many A's in the word orange. Although surprisingly GPT 5. 2 still can't get that right. No, I mean why are they still sometimes so brittle memorizing that Tom Smith's wife is Mary Stone but not deducing that Mary Stone's husband is Tom Smith? And what does any of this mean for your job, white collar or otherwise? What does the latest data show? First of course, a quick word on Claude Co-work, which inevitably it seems some people are calling AGI. This, of course, follows numerous viral posts and articles about the underlying model Claude Opus 4. 5 when given the right scaffold already being AGI. Indeed, a long list of notable commentators have this perspective. These posts can lead, of course, to two very desperate reactions, both of which I'd advise against. One that it's all BS, all hype merchants. these tools hallucinate all the time and are pretty much useless and second that they are AGI perhaps and I you are just missing out. We can't understand how to use them. We're missing out so much our careers are doomed. This video is hopefully going to channel you down the middle path which is you can get great productivity gains but they're not there yet. For context, I've been using Claude Code for a very long time and co-work for the last 48 hours. to slightly debunk the hype point. If I gave a new employee this task, make a comparison chart for this football club's league position at this date today for each of the last five seasons. Add it as a PowerPoint to my desktop. Oh, and ask any clarifying questions and share a plan of how you'll approach this task. I would expect, and let me know if you agree or not, for them to either say at the end of the day, I couldn't find any source to give definitive answers on that question or to have produced the relevant PowerPoint. Now, you can see the co-work tab here and the kind of questions it lays out, and it does indeed give a great plan. I approved it immediately, and it didn't even take that long to be honest. The result, I would say, was visually quite impressive and pretty much acceptable. Obviously, you have to pick a moderately hard task because if it's too easy, you just do it yourself. So, this was the result. Slight problem. I checked two of the dates it gave me for January 2023 and 2025 and the league position of this club, and both were incorrect. I manually checked and within about 5 minutes I found two other data sources BBC and this site 11v11 both of which said that Stockport were seventh at the time not third for January 13th 2025. This co-working AGI by the way did not caveat its results in its summary to me that it couldn't find a reliable source either. Now, I could of course give you hundreds of such examples from the legendary Claude code powered by Claude Opus 4. 5, but that wouldn't be too interesting or fair on you cuz you'd have to see the whole context of the codebase. I just don't want you guys to walk away from these viral posts thinking unless I spend all my money and keep up with a tool released just last week, I'm going to completely fail at my white collar job. And if the models make any mistakes, I'm the dumb one. I must have done something wrong. But I don't want you to make the opposite mistake, which is to completely ignore these tools and think that they can't boost your productivity at all. The truth lies somewhere in the middle. And look, even the lead developer for Claude Code said as much later on in a reply after saying all of the code for Claude Co-work was written by Claude Opus 4. 5. He clarified, "It was not zero intervention. We the humans had to plan, design, and go back and forth with Claude. Which then for my super smart audience leads to a key question. Well, is it faster to get Claude code to do the draft and then reddraft and then test fail reddraft and then kind of get it right or for the human to just do it themselves from scratch, whether it be coding or just other white collar work. Thankfully, we have a key clue from this OpenAI paper from October of 2025. Using blind human grading, we have already passed that tipping point. We get more of a productivity multiplier by getting models to try again and again and the human to just step in, review, and edit than from the human just doing it themselves. This GDP valve paper covers dozens of white collar industries and I did an entire video on it. So I'm not going to go into too much depth, but that for me is the real tipping point. And yes, I've experienced that in my own coding, which I do almost every day. It makes a bunch of dumb and sometimes dangerous mistakes, but don't throw the baby out with the bath water. Even take my stopport PowerPoint. It's really quite welldesigned and almost all the other facts are true. So I could just edit a couple of the numbers and have a decent presentation in less time than creating it myself from scratch. Quick bit of technical detail. Claude co-work is only available on the max tier. Minimum $90 or $100 and on max only. That's Mac OS, not Macs. Mac OS, not Windows. But also Max only, not the pro tier of Claude. Notice this productivity speed up though is only true for a
LMS can seem so brittle. Navigating incredibly complex code bases to pick out a minuscule bug, but then sometimes clawed co-work going along merrily and deleting 11 GB of files randomly from a guy's desktop according to one user from 2 days ago. Why do they do that? Well, in short, because there are multiple levels of quote understanding in large language models. First though, I'm going to give you a freaky thought. We don't even know what the word understanding means in English. Like we know what it denotes, but what are we under? If the under prefix isn't the usual meaning of beneath, is it like the under of undergo or under the circumstances? The best guess of the ethmology of the word understand seems to be between or among the ideas in the presence connected to something rather than being distant. Again though, it seems like early humans didn't fully grog or understand what understanding meant, like being in the presence of something. And even synonyms like comprehend means to essentially grasp something. But why would holding something or grasping it mean you get it logically, intellectually? But then the ethmology of the word intelligence is to pick between things. So it's no wonder that if we have this cloud of notions about standing in the presence of something, picking between things, having a grasp on things, that essentially if we don't have a fully intuitive definition of understanding that we would struggle to ascribe understanding to LLM. In this paper from Beckman and Quaos, they give three categories of understanding. Simple conceptual understanding, just registering that there are connections between diverse manifestations of an entity. That's it. Just finding connections between two things. Then second stage, state of the world or contingent understanding. These things being true or connected only in certain circumstances at certain times. Then the ultimate, what I've described in other videos as efficiently deriving new functions. That's principled understanding. The ability to grasp the underlying principles or rules that unify a diverse array of facts. If you don't have much time, the TLDDR from this paper is that LM possess understanding distributed across a mly mix of mechanisms across all three tiers. They don't in a sense aspire to simplicity or parimony. They just learn whatever connection brittle or deeply algorithmic that will get the job done. They can reach that third stage of understanding deriving deep algorithms and patterns from the world. They can grock how to do addition and therefore delete the memorized pairs of what this plus this adds up to and they plan ahead with poems. On the token before a new line of a poem starts, there is a circuit within Claude already planning what the rhyme will be and the semantics needed to achieve that rhyme. Researchers have found computable circuits for numerical comparison, multiplechoice question answering, and even as I discussed in the autumn of last year, circuits for recognizing that introspection is called for. Given that these circuits are well- definfined and reusable, who are we to say that they haven't understood the concept? But here's the thing. LM also rely on brittle memorization. They pragmatically toggle between modeling the state of the world and relying on the shallow curistics or rules of thumb depending on which circuit minimizes loss, makes their predictions better most efficiently. They're kind of like a lazy bright kid who sometimes forces themselves to properly learn the material and other times just memorizes what they need. The fact that they sometimes use memorization though does as the authors know undermine the basis for epistemic trust. When they got something right, did they rely on that unifying mechanism or merely a swarm of shallow huristics? Of course, cognitive psychology also points to the fact that humans do the same, sometimes relying on shortcuts, saying or doing the first thing that comes to their mind on a local or international stage. Other humans try to doublech checkck those huristics and think deeply about problems. So when you speak to an LLM, the authors know it's a bit like speaking to a gigantic committee of drastically varying expertise. Higher quality circuits are sometimes reinforced, but sometimes also drowned out by lower quality circuits. Remember, these are alien intelligences doing whatever they can, the easy way or the hard way, to predict the next word or token. To a human, the sentence Tom's wife is Mary is an embodied concept. It has dozens and dozens of connotations, not least that Mary's husband is Tom. For an LLM, the first time they hear Thomas Smith wife is Mary, that just updates their weights as to predicting what comes after in the future, Tom Smith wife is, or maybe permutations like the wife of Tom Smith is. They haven't bound those concepts though, so they've got no reason to believe that the sentence Mary Stone's husband is will end with Tom. Now, as various other papers discuss, this particular weakness can be solved through data augmentation, but that's not my point. My point is that LLMs can understand things at a very deep level and also simultaneously at a very shallow level. There is mixed evidence that reinforcement learning can strengthen those higher circuits, if you will. But this and other papers show that once an LLM has learned enough to get the question right most of the time it has with current methods much less incentive to learn even higher circuits to get it right even more often. We are though exploring an alien landscape. There could well be a breakthrough a month from now, 2 months from now wherein we incentivize models to reach much higher planes of understanding. For this paper that could be achieved by encouraging models to reach that state of almost confusion. That's when multiple avenues can be explored most productively. And what levels of understanding could they reach if they're trained on a diverse range of new modalities? The American government is giving AI labs access to a dozen national laboratories from the US. And that's before we even get to hybrid architectures that have proven their worth with, for example, weather forecasting. Anyway, this video is getting too long. The point is to leave you somewhere between those two extremes. You're not alone if AI models constantly make mistakes on your workflow. Nor though would it be fair to say that they're all hype. For me, maximal understanding of them and productivity using them comes from that place in the middle. Thank you so much for watching and have a wonderful