Anthropic: Our AI just created a tool that can ‘automate all white collar work’, Me:

19:02

Anthropic: Our AI just created a tool that can ‘automate all white collar work’, Me:

AI Explained 14.01.2026 104 556 просмотров 3 341 лайков

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Описание видео

A new tool, with code written *only* by AI, has gone omega-viral: Claude Cowork. But is the hype justified? What do the stats say on productivity? Where is the truth in a sea of noise? What is truth? Can we handle the truth? Where's Nemo? https://matsprogram.org/s26-aie Check out my new app! https://lmcouncil.ai AI Insiders ($9!): https://www.patreon.com/AIExplained Chapters: 00:00 - Introduction 01:12 - Claude Cowork 07:36 - Productivity Speed-up + jobs 10:19 - Comparing Models 12:46 - Brittle AI Paper Cowork Intro: https://x.com/claudeai/thread/2010805682434666759 'All of it': https://x.com/bcherny/status/2010813886052581538 'AGI' Claims: https://x.com/deepfates/status/2004994698335879383 Douglas Interview: https://www.youtube.com/watch?v=TOsNrV3bXtQ&t=2313s Job Stats: https://www.oxfordeconomics.com/wp-content/uploads/2026/01/Evidence-of-an-AI-driven-shakeup-of-job-markets-is-patchy.pdf Amodei Prediction: https://fortune.com/2025/05/28/anthropic-ceo-warning-ai-job-loss/ GenAI Traffic: https://x.com/demishassabis/status/2009075877347512545 Illusion of Insight: https://arxiv.org/pdf/2601.00514 Entropy Exploration: https://arxiv.org/pdf/2506.14758 ProRL: https://arxiv.org/pdf/2505.24864 Genesis Mission: https://www.whitehouse.gov/presidential-actions/2025/11/launching-the-genesis-mission/ https://deepmind.google/blog/how-were-supporting-better-tropical-cyclone-prediction-with-ai/ Non-hype Newsletter: https://signaltonoise.beehiiv.com/ Podcast: https://aiexplainedopodcast.buzzsprout.com/

Оглавление (5 сегментов)

Introduction

the CEO of one of the major AI labs predicted last year that by around now 100% of the code written by that company would be produced by one of their AI models. Next up within 2026 would be all other knowledge work and a new tool released by Anthropic in the last couple of days seems to back that up. It's called Claude Co-work. Not only has it gone omega viral at 42 million views for its ability to automate non-coding tasks, the tool itself was produced within Clawude code powered by their latest frontier model, Claude Opus 4. 5, thereby seeming to justify the prediction that essentially all of the code would by now be written by AI. So wait, if they got that right, does that mean that Anthropic and those like Schultto Douglas are correct when they say that in 2026, this year the same will be true of automating all white collar work. — The most striking thing about next year is that the other forms of knowledge work are going to experience what software engineers are feeling right now where they went from typing, you know, most of their lines of code at the beginning of the year to typing barely any of them at the end of the year. I think of this as the Claude code experience, but for all forms of knowledge work. I also think that probably continual learning gets solved in a satisfying way.

Claude Cowork

— Well, I've been using Claude code for quite a while and yes, have been playing about with the new Claude co-work. And for me, those predictions are just not true. But so many of us might then throw the baby out with the bath water and miss out on some pretty crazy productivity gains. So, I'm going to show why we shouldn't underestimate the gains to be had either. Then, for those who want to go a bit deeper, I'm going to end with the why. Why can models produce genius like seeing tiny bugs in large code bases and writing for me powerful poems but also still fail at such basic tasks? No, I don't mean how many A's in the word orange. Although surprisingly GPT 5. 2 still can't get that right. No, I mean why are they still sometimes so brittle memorizing that Tom Smith's wife is Mary Stone but not deducing that Mary Stone's husband is Tom Smith? And what does any of this mean for your job, white collar or otherwise? What does the latest data show? First of course, a quick word on Claude Co-work, which inevitably it seems some people are calling AGI. This, of course, follows numerous viral posts and articles about the underlying model Claude Opus 4. 5 when given the right scaffold already being AGI. Indeed, a long list of notable commentators have this perspective. These posts can lead, of course, to two very desperate reactions, both of which I'd advise against. One that it's all BS, all hype merchants. these tools hallucinate all the time and are pretty much useless and second that they are AGI perhaps and I you are just missing out. We can't understand how to use them. We're missing out so much our careers are doomed. This video is hopefully going to channel you down the middle path which is you can get great productivity gains but they're not there yet. For context, I've been using Claude Code for a very long time and co-work for the last 48 hours. to slightly debunk the hype point. If I gave a new employee this task, make a comparison chart for this football club's league position at this date today for each of the last five seasons. Add it as a PowerPoint to my desktop. Oh, and ask any clarifying questions and share a plan of how you'll approach this task. I would expect, and let me know if you agree or not, for them to either say at the end of the day, I couldn't find any source to give definitive answers on that question or to have produced the relevant PowerPoint. Now, you can see the co-work tab here and the kind of questions it lays out, and it does indeed give a great plan. I approved it immediately, and it didn't even take that long to be honest. The result, I would say, was visually quite impressive and pretty much acceptable. Obviously, you have to pick a moderately hard task because if it's too easy, you just do it yourself. So, this was the result. Slight problem. I checked two of the dates it gave me for January 2023 and 2025 and the league position of this club, and both were incorrect. I manually checked and within about 5 minutes I found two other data sources BBC and this site 11v11 both of which said that Stockport were seventh at the time not third for January 13th 2025. This co-working AGI by the way did not caveat its results in its summary to me that it couldn't find a reliable source either. Now, I could of course give you hundreds of such examples from the legendary Claude code powered by Claude Opus 4. 5, but that wouldn't be too interesting or fair on you cuz you'd have to see the whole context of the codebase. I just don't want you guys to walk away from these viral posts thinking unless I spend all my money and keep up with a tool released just last week, I'm going to completely fail at my white collar job. And if the models make any mistakes, I'm the dumb one. I must have done something wrong. But I don't want you to make the opposite mistake, which is to completely ignore these tools and think that they can't boost your productivity at all. The truth lies somewhere in the middle. And look, even the lead developer for Claude Code said as much later on in a reply after saying all of the code for Claude Co-work was written by Claude Opus 4. 5. He clarified, "It was not zero intervention. We the humans had to plan, design, and go back and forth with Claude. Which then for my super smart audience leads to a key question. Well, is it faster to get Claude code to do the draft and then reddraft and then test fail reddraft and then kind of get it right or for the human to just do it themselves from scratch, whether it be coding or just other white collar work. Thankfully, we have a key clue from this OpenAI paper from October of 2025. Using blind human grading, we have already passed that tipping point. We get more of a productivity multiplier by getting models to try again and again and the human to just step in, review, and edit than from the human just doing it themselves. This GDP valve paper covers dozens of white collar industries and I did an entire video on it. So I'm not going to go into too much depth, but that for me is the real tipping point. And yes, I've experienced that in my own coding, which I do almost every day. It makes a bunch of dumb and sometimes dangerous mistakes, but don't throw the baby out with the bath water. Even take my stopport PowerPoint. It's really quite welldesigned and almost all the other facts are true. So I could just edit a couple of the numbers and have a decent presentation in less time than creating it myself from scratch. Quick bit of technical detail. Claude co-work is only available on the max tier. Minimum $90 or $100 and on max only. That's Mac OS, not Macs. Mac OS, not Windows. But also Max only, not the pro tier of Claude. Notice this productivity speed up though is only true for a

Productivity Speed-up + jobs

certain number of the latest models. those most likely to be tried by enthusiasts like us, less so the general population, and also that those models like GBTC 5. 2 Pro or 4. 5 Opus are also gated heavily by price. If we are right about that tipping point and about how few people are using the latest models with the best scaffolds, then you'd expect the current AI impact on productivity and the labor market to be relatively limited. And what does the data show according to this January 7th, 2026 report from the widely cited Oxford Economics? Well, to me, it shows exactly that. Yes, new graduates face slightly higher unemployment, but that isn't out of line with other historical trends. If you're listening to this, the new graduate unemployment rate has been much higher in the quite recent past, like 2015 or 2010. The authors note, if you zoom in on this graph, that there's actually been a slight downward trend from March to September of last year. The authors say, "We don't expect AI to significantly raise the jobless rate in the US or elsewhere over the next year or two. That doesn't mean that there hasn't been any human job impact from AI, even causal impact. For example, sectors where there are potentially the most easy wins from AI adoption have a greater incentive, they say, to put the new technology to the test. Think of customer service operations and companies like Pler. To finance this, they go on, budgets for the other parts of the business, including wages, may have been cut. Fine, but what about the numerous headlines of mass job layoffs or a job apocalypse imminent? Well, the authors say if AI is already leading to mass layoffs of obsolete workers, then it also stands to reason that measures of labor productivity should be increasing as the same output is produced with fewer workers. There are lots of cyclical factors that affect productivity, but if you focus on 2025 in teal, I'm color blind, but I think that's teal. You can see that it's not marketkedly higher in terms of labor productivity per hour growth than in previous years or periods. In fact, productivity growth in 2025 looks smaller than that for say 2000 to 2007 in every period. Why would so many companies proclaim that they're cutting jobs due to AI then? Well, the authors say linking job losses to increased AI usage rather than other negative factors like weak demand or excessive hiring in the past conveys a more positive message to investors. I think when a lot of companies and individuals discovered just how much LLM hallucinate, the initial wave of adoption and testing kind of peted out toward the middle of last year, but there has been a more recent uptick. People are of course starting to compare different models for their use case. As the CEO of Google

Comparing Models

DeepMind, Demis said, making relentless progress, pointing to the fact that chatbt share of generative AI is falling fairly notably. That's the thesis of course behind the app that I designed, lmconsil. ai, where you can compare the answers of all the Frontier models in a nice and customizable format. Indeed, even getting models to chat amongst themselves has proven so popular a feature here that I've produced a shortcut where with semicolon you can kickstart a self chat amongst the models. By the way, if I saw mass job layoffs coming, I would totally warn you guys as much as I could, but I'm less on the Dario Ammedday side of things. I'm more on the Jensen Huang side of things where he recently said in an interview a couple of days ago, don't mistake the purpose of a job for the series of individually automatable tasks within that job. Take a football commentator. You could automate the voice of that tactical analyses done by that commentator. You could do it all faster and cheaper. But the ultimate purpose, you could say, of a football commentator is to entertain you while you're watching, keep you engaged with the game. And that purpose might not be best served by an AI model. And that missing the wood for the trees is why we're going to turn to the second part of this video. The why behind the models being brittle in certain circumstances. Why do they seem 200 IQ one moment and 50 IQ another? This week I've been reading a series of papers on that topic and probably my favorite is this one from just what is that 6 days ago. I will note that if you are the kind of person drawn to the wise behind LLMs, I do have a quick word about our sponsor and that is the MATS program with a deadline of just 4 days from now for applications to their summer 2026 program. As you might know, Mattz finds and trains researchers working on possibly one of the most talent constrained problems in the world, which is reducing risks from unaligned AI models. You might be familiar with them because their alumni have gone on to work at places like meter, Anthropic, DeepMind, and of course, many other places. As you might expect, given it meets my standards for a recommendation, the program also comes with world-class mentorship, a stipen, compute budget, and full cost coverage. Do check out my link in the description. So, back to the specific question of why

Brittle AI Paper

LMS can seem so brittle. Navigating incredibly complex code bases to pick out a minuscule bug, but then sometimes clawed co-work going along merrily and deleting 11 GB of files randomly from a guy's desktop according to one user from 2 days ago. Why do they do that? Well, in short, because there are multiple levels of quote understanding in large language models. First though, I'm going to give you a freaky thought. We don't even know what the word understanding means in English. Like we know what it denotes, but what are we under? If the under prefix isn't the usual meaning of beneath, is it like the under of undergo or under the circumstances? The best guess of the ethmology of the word understand seems to be between or among the ideas in the presence connected to something rather than being distant. Again though, it seems like early humans didn't fully grog or understand what understanding meant, like being in the presence of something. And even synonyms like comprehend means to essentially grasp something. But why would holding something or grasping it mean you get it logically, intellectually? But then the ethmology of the word intelligence is to pick between things. So it's no wonder that if we have this cloud of notions about standing in the presence of something, picking between things, having a grasp on things, that essentially if we don't have a fully intuitive definition of understanding that we would struggle to ascribe understanding to LLM. In this paper from Beckman and Quaos, they give three categories of understanding. Simple conceptual understanding, just registering that there are connections between diverse manifestations of an entity. That's it. Just finding connections between two things. Then second stage, state of the world or contingent understanding. These things being true or connected only in certain circumstances at certain times. Then the ultimate, what I've described in other videos as efficiently deriving new functions. That's principled understanding. The ability to grasp the underlying principles or rules that unify a diverse array of facts. If you don't have much time, the TLDDR from this paper is that LM possess understanding distributed across a mly mix of mechanisms across all three tiers. They don't in a sense aspire to simplicity or parimony. They just learn whatever connection brittle or deeply algorithmic that will get the job done. They can reach that third stage of understanding deriving deep algorithms and patterns from the world. They can grock how to do addition and therefore delete the memorized pairs of what this plus this adds up to and they plan ahead with poems. On the token before a new line of a poem starts, there is a circuit within Claude already planning what the rhyme will be and the semantics needed to achieve that rhyme. Researchers have found computable circuits for numerical comparison, multiplechoice question answering, and even as I discussed in the autumn of last year, circuits for recognizing that introspection is called for. Given that these circuits are well- definfined and reusable, who are we to say that they haven't understood the concept? But here's the thing. LM also rely on brittle memorization. They pragmatically toggle between modeling the state of the world and relying on the shallow curistics or rules of thumb depending on which circuit minimizes loss, makes their predictions better most efficiently. They're kind of like a lazy bright kid who sometimes forces themselves to properly learn the material and other times just memorizes what they need. The fact that they sometimes use memorization though does as the authors know undermine the basis for epistemic trust. When they got something right, did they rely on that unifying mechanism or merely a swarm of shallow huristics? Of course, cognitive psychology also points to the fact that humans do the same, sometimes relying on shortcuts, saying or doing the first thing that comes to their mind on a local or international stage. Other humans try to doublech checkck those huristics and think deeply about problems. So when you speak to an LLM, the authors know it's a bit like speaking to a gigantic committee of drastically varying expertise. Higher quality circuits are sometimes reinforced, but sometimes also drowned out by lower quality circuits. Remember, these are alien intelligences doing whatever they can, the easy way or the hard way, to predict the next word or token. To a human, the sentence Tom's wife is Mary is an embodied concept. It has dozens and dozens of connotations, not least that Mary's husband is Tom. For an LLM, the first time they hear Thomas Smith wife is Mary, that just updates their weights as to predicting what comes after in the future, Tom Smith wife is, or maybe permutations like the wife of Tom Smith is. They haven't bound those concepts though, so they've got no reason to believe that the sentence Mary Stone's husband is will end with Tom. Now, as various other papers discuss, this particular weakness can be solved through data augmentation, but that's not my point. My point is that LLMs can understand things at a very deep level and also simultaneously at a very shallow level. There is mixed evidence that reinforcement learning can strengthen those higher circuits, if you will. But this and other papers show that once an LLM has learned enough to get the question right most of the time it has with current methods much less incentive to learn even higher circuits to get it right even more often. We are though exploring an alien landscape. There could well be a breakthrough a month from now, 2 months from now wherein we incentivize models to reach much higher planes of understanding. For this paper that could be achieved by encouraging models to reach that state of almost confusion. That's when multiple avenues can be explored most productively. And what levels of understanding could they reach if they're trained on a diverse range of new modalities? The American government is giving AI labs access to a dozen national laboratories from the US. And that's before we even get to hybrid architectures that have proven their worth with, for example, weather forecasting. Anyway, this video is getting too long. The point is to leave you somewhere between those two extremes. You're not alone if AI models constantly make mistakes on your workflow. Nor though would it be fair to say that they're all hype. For me, maximal understanding of them and productivity using them comes from that place in the middle. Thank you so much for watching and have a wonderful

Другие видео автора — AI Explained

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник