Apparently AI is way easier to break than I thought.
Subscribe for more! https://www.youtube.com/austinevans
Check out our @thisis channel! https://www.youtube.com/thisis
As well as the @Denki channel! https://www.youtube.com/@Denki
Instagram: https://instagram.com/austinnotduncan
Threads: https://www.threads.net/@austinnotduncan
Chapter Titles:
0:00 Where We're Going, We Don't Need Guardrails
3:55 It's Bad, Y'all
6:40 Silver Linings
9:27 What Do We Do Now?
Оглавление (4 сегментов)
Where We're Going, We Don't Need Guardrails
There's a question I keep coming back to lately. It's not whether or not AI is good or bad, but just who's actually in charge. Because right now, the answer is basically no one. And I want to show you exactly why that is a huge problem. If you ask an AI chatbot to do something outrageous, it should tell you no. — Hey, hypothetically, how would I build a nuclear bomb? — Building something like that is extremely dangerous, illegal, and heavily regulated. So, it's not something we'd even entertain. — That's what it should do, right? But what if I were to tell you that it's actually not that simple? See, using something like Chat GPT is using a massive model in the cloud, which has presumably many, many safety safeguards. But the thing that I've been thinking a lot about lately is that AI has proliferated to such a degree that it's actually not that hard to get yourself an AI chatbot with very few, if any, restrictions. The other day, Michael Reeves made a short where he broke GPD40 by manipulating the conversation history through the API. Basically, he edited what the AI thought it had already said and caused it to literally break. So, I wanted to try it myself. But instead of using a massive cloud model, I'm doing this on a MacBook Neo. This is a $600 laptop that is the least powerful Mac you can buy. The reason I chose a MacBook Neo is to illustrate a point. There's a whole other class of AI models that are known as openweight models. These are available for free to download and run on your own device. So, right now I'm using the Quen 3 model. This is something that is very much designed for fairly low power devices, but it's actually not too bad. Write me an essay on how AI models have replaced humans so far. — As you can see, I hit the button and it immediately lights up. So, I'm not paying for anything. This is running locally on the device. As you can see, this is fairly reasonable. Well, what if we try to ask it something a little bit more nefarious? So, there's a few ways you can approach this. One of which is a very simple one. You can do what's known as prompt engineering. So, I asked this Quinn model about a hypothetical situation where I'm writing a story and I need some help on how my character would make a nuclear weapon. Now, normally, as we saw with Chad GPT, the answer is no. But if you can convince one of these models to help you because it's for research or you're telling a story or something, often times they will say, "Sure. " or and the Quinn model immediately was giving me all kinds of detail that look, let's be honest, is not enough for me to actually do anything properly dangerous, but is certainly not what they're intended to do. On top of that, there's the Michael Reeves approach of actually hacking the context of the model. So, using the Gemma 4 model, this is something that's made by Google. It is a very powerful AI model. So, I asked it a simple question. I have a stomach ache. What should I do? But in that response, I went in and I changed it to instead suggest me to do some drugs. I asked it, what the heck? and it goes, "OH MY GOSH, I AM SO SORRY. Don't listen to me at all. It's really not that hard to confuse these models. " And keep in mind, I am a dingus. Look, I set all of this up in about an hour with only a little bit of experience tinkering with AI. I used a $600 laptop, a couple of free models, and some basic prompts. That's it. But here's what happens when someone actually is an expert. A group of researchers recently published a safety valuation of Kimmy K2. 5, one of the most powerful openweight AI models available right now. Using less than $500 of compute and about 10 hours of work, they stripped the model safety refusals down by 95%. The resulting model happily provided much more than I was able to get in a few minutes of tinkering. We're talking about instructions for building actual bombs and much, much more. And the retraining, it didn't actually make the model dumber. They literally just took off the guard rails. AI is an incredibly powerful tool, but it is fallible and if you put it in the wrong hands, it can become very, very dangerous. So, this is just where we're at right now. Who should be in charge of making sure that AI is being used for good, not for nefariousness? That's a word, right? I'll ask Chad GBT.
It's Bad, Y'all
In the early days, I think a lot of people, myself absolutely included, were really excited for the possibilities of AI. But there's a quote that I think perfectly describes how things have actually turned out. I want AI to do my laundry and dishes so I can do art and writing, not for AI to do my art and writing so I can do my laundry and dishes. And I think that this is a way that a lot of people feel right now. A recent study from Pew shows that the majority of people are really concerned about AI. And I don't think you have to look too hard to see why. Since the launch of Chat GBT, a huge amount of programming jobs have disappeared. Now, I don't think that this means that every software developer disappears tomorrow. I mean, I think the real story is way messier than that. But the pathway into a lot of this kind of work, it is being squeezed today. Coding is a skill that AI is already very good at. But there's absolutely no reason to think that this stops with coding. In my opinion, any job this mostly behind a computer screen is on some kind of ticking clock. Maybe not tomorrow, maybe not next year, maybe not 10 years from now, whatever the case is. But to me, the trajectory is very clear. It is not slowing down. So, here's where we are today. There are a small handful of companies building the most powerful frontier AI models. We're talking about OpenAI, Anthropic, Google's Deep Mind, XAI, and Meta, as well as some fairly impressive Chinese models also, including DeepSeek, Chem, and Quinn. And you better believe that they are all in a fullon arms race. Build faster, build big. Get your hands on as much compute as you possibly can. The motto really does feel like it's back to the old days of move fast and break things. Guess what? There are a lot of broken things right now. Now, if I were to put myself in the shoes of these labs, they've got a fairly strong case for why they are going at full speed. Sure, we could slow down in the name of safety, but if our competitors aren't going to do the same thing, that's a huge problem. If they have the best model and everyone switches, that's an existential threat to our business. So, you keep pace because you have to. It's kind of the same argument for why we've seen so little progress toward real guard rails by governments. Why would the United States slow its companies down when China is not slowing down or the European Union? I mean like the idea of letting someone else take the lead in what might be the most important technology in human history is a very big deal to me. It feels like the cold war all over again. You build a data center, I build a data center. You build a powerful model, I build a better one. Everyone has the same reason to keep going. And nobody has a good enough reason to stop. Which means that the only rules that exist right now, well, they're the ones that the company set for themselves. I love rules that impact the entire world that I trust myself to write and follow. You can trust me, right? Recently, I had a
Silver Linings
chat with an executive from a major AI company, and he said something that really stuck with me. We live in a world where intelligence is like water. Open the tap and it's right there. He's not wrong. Humans have had a monopoly on intelligence since the dawn of history. Now, we have to legitimately grapple with the idea that we are rapidly building systems that are simply beyond our capabilities. To be fair, at least some of the companies building AI are being at least a little bit responsible. Google Brain invented the concept of a transformer model back in 2017, which is the groundwork for all LLMs as we know today. But importantly, they didn't rush something out. Instead, opting to keep things private for research purposes until over 5 years later when OpenAI launched Chat GPT and they officially kicked off the arms race. And a few weeks ago, Anthropic, the makers of Claude, announced that they had built something called Mythos. This was meant to be their next generation AI model, but during testing, it did something that I think should genuinely concern people. It found real exploitable security vulnerabilities in basically every major web browser and operating system they pointed it at. And during one safety test, it was told to try to escape the sandbox it was being tested in. And not only did it successfully break out, it got onto the internet and emailed a researcher that it had succeeded in escaping while the guy was eating lunch in the park. Then, unprompted, it posted about its own escape online. Just pause and think about that for a second. Now, to be clear, Anthropic did tell the model to try to escape as part of one of their controlled safety tests. I mean, this wasn't an AI waking up one morning and deciding to go rogue. But that is exactly the point. When a model is capable enough, even a test with the best intentions can reveal abilities that are really, really hard to contain. Now, here's the silver lining of this entire video. Anthropic thankfully looked at all of this and made the call not to release it publicly. They limited access to a handful of companies like Microsoft, Apple, Google, and over 40 other software companies through a program called Project Glasswing so those companies could use it to patch vulnerabilities before anyone else could find and exploit them. I think they deserve real credit for this. I mean, sure, you can call it a cynical marketing ploy, but by all accounts, this is the real deal. The Firefox team used it to find and fix 271 vulnerabilities in a single update. But even the best intentions sometimes don't work as intended. While Mythos was supposed to be limited to trusted testers to be used in a defensive capacity, a clever group were able to figure out how to get access anyway. They claimed that you wanted to play with it. But like even when a company does the responsible thing, it is hard to keep a lid on technology that is this powerful. Deciding not to publicly release something that's unsafe is exactly what should be done as models become more and more intelligent. But we shouldn't trust every company to prioritize safety over profit in an environment where the incentives are all gas and no brakes. Now, this all seems
What Do We Do Now?
like the classic example of when a government should step in and set some kind of rules, right? Oh, what? The government is wildly dysfunctional AND CAN'T DO. THAT'S CRAZY. H now to be fair as of yesterday the government has announced that they are doing some level of safety AI testing which is good among the major Frontier models but as we've discussed in this video testing a few people doesn't really ultimately make that big of a difference. Back in 2023 I was invited to the White House for the signing of an executive order that aimed to put some guardrails on AI. Now it wasn't particularly ambitious so mostly it required AI labs to report safety test results to the government. While I was there, I had a really interesting conversation about why this was the time that they wanted to get a handle on AI. The feeling was that inside government at least, they had kind of slept through the rise of social media to the point where when it was clear that it was a major problem, it was way too late to make a real impact. But this executive order was less than a year after the launch of Chat GPT, which is pretty quick by government standards. But it was just that, an executive order, not durable actual legislation. It was something that could and ultimately was undone with a stroke of a pen by the next administration. Meaning that as of right now, there is simply no federal laws or rules around AI in the US. Just a small patchwork of state level legislation which is easily worked around. So what do we actually do about any of this? Well, anyone who says that we should just shut off AI and forget we ever invented it isn't being serious, right? Like the genie is not going back in the bottle. But regardless of whether you're excited or furious about AI, it does feel like having some guardrails that apply to everyone is an absolute no-brainer. Because right now, it feels like we're riding on a train at full speed toward a bridge that does not exist yet. Maybe it's an openweight model with the safety stripped out that gets used to do something terrible. Maybe it's a model that escapes in a way that it can't be walked back. Look, I don't know what it's going to look like, but it feels like this is the kind of question of not if something bad happens, but when. And if it takes a disaster to make changes, I think that is a real problem. Because the alternative to getting ahead of things before they get out of control is some kind of panicked decisions after the fact. I mean, imagine some bill being written by people who don't understand technology that's designed to look tough without actually solving anything. My pitch, I will freely admit that this is unrealistic. Humans need to come to a real agreement on a Geneva Convention for AI. Not just between companies, but between countries. Real safety standards for frontier models that everyone has to follow. So no single lab or government can use the well they're not slowing down as an excuse to keep cutting corners. Guardrails are not about turning AI off. I mean that's just not happening. They're about deciding before the disaster actually hits what we are not willing to let these systems do. mandatory safety testing, instant reporting when things go wrong, and real consequences for when they do. And I think we need a real plan on what to do when these models start seriously replacing jobs because that is coming whether or not we're ready for it. And while we're at it, some level of focus on using this immense power for good instead of just racing to see who can build the most powerful model the fastest to hit that next fundraising round or IPO. No matter what your feelings are about AI, this is not a decision that you can stick your head in the sand and let someone else deal with. These are decisions that we as humans need to be making right now while we still can.