How To Win Humanity's Last Hackathon - The hardest agent contest in AI.
17:29

How To Win Humanity's Last Hackathon - The hardest agent contest in AI.

HuggingFace 29.04.2026 2 337 просмотров 102 лайков

Machine-readable: Markdown · JSON API · Site index

Поделиться Telegram VK Бот
Транскрипт Скачать .md
Анализ с AI
Описание видео
Follow this org to sign up: https://hf-learn.short.gy/nvB8JD The hardest agent contest in AI just launched. Here's how to win it. You can now sign up to Humanity's Last Hackathon. You build Mac Metal kernels. You use Codex from OpenAI to optimize them. You submit through Hugging Face. The fastest kernels qualify for the final battle. What this video covers: The qualification task, explained Setting up Codex for kernel work Benchmarking and submitting through Hugging Face What it takes to climb the leaderboard and advance Launch: May 15th, 2026

Оглавление (4 сегментов)

Segment 1 (00:00 - 05:00)

Hi, welcome to this video on how to win humanity's last hackathon. This is a hackathon in the normal sense. There's going to be winners, a leaderboard, and you're going to need to complete a pretty hard task. However, in many ways, this hackathon is not a normal hackathon. Instead of having to write code to solve a problem, you're going to have to use agent context to solve the problem. So you're going to use Codex for free and define the context that Codex gets in order to solve an extremely hard AI systems engineering problem. In this case, we're going to focus on optimizing kernels for Mac Metal. This hackathon is sponsored by OpenAI, Hugging Face, GPU Mode, and ExecuTorch from PyTorch. The hackathon itself is going to start on the 15th of May, so just a couple of weeks away. The format will be that you first qualify through optimized kernels, so the fastest machine, and then you go to the final round where the context that you use will be tested against other kernels. If you want to join in, just come to this site, there's a link in the description, and click register. Once you click register, you'll join the organization on Hugging Face, and you'll get all the updates you need. There's also an overview here of exactly how it's going to work. I'll just give you a summary of that, and then I'm going to go into some details about how to win the contest. So there'll be three Mac Metal kernels. We're obviously keeping those very secret, so in this video, we'll use a different set of kernels. Also, in the final round, we use another set of secret kernels, so it's not something that you can prepare for. As I said, every participant is going to get free use of codecs for this, so you can really go as far as you need. Why are we optimizing kernels? What's the point? So kernels are how AI models interact with the hardware that they're using. So the better we can optimize these kernels, the better models will work on different hardware. So if you're interested in running a model on your own machine, especially if it's Mac Metal, then this is a perfect competition for you. However, the skills will transfer across other hardware and other models. What's the prize? The prize is a free year of ChatGPT Plus and Hugging Face Pro, so you can go pretty far with that, and you can build even more tools after the contest. Scoring. The first round will be scored on the speed of the kernels, and the second kernels context in a held-out kernel setting. How can you stay up to date? Join the Discord channel or follow us on the Hugging Face Hub organization. Okay, so let's just check out some slides first and see what we're talking about when we refer to kernels. Okay, so kernels are, as I said, the way that a machine learning framework, a machine learning model will interact with the hardware. For a long time, these were seen as simply too challenging for most agents. However, that's not really the case anymore. And actually, many people are using agents to write kernels, and they're publishing them on the Hub, and it's a growing ecosystem, just like models and datasets, but now for kernels. Let's give you just a quick idea of what it means to optimize a kernel and how you'll do that. To win this contest, you're going to have to do quite a bit of research and dig further into this and understand the hardware that you're using, but I'm going to give you an overview that will help you start off. So what is efficiency in deep learning? There's three main parts of efficiency. There's the compute, the memory, and the overhead. Compute is the time spent doing operations, multiplying multiplications, and these kinds of things. Memory is the time spent moving tensors or data around different types of memory. And overhead is everything else. Most of us might assume that compute is the main bottleneck, but in many cases, it's memory. GPUs have different types of memory that are faster and slower, and so we can optimize this by making sure that the GPU takes most advantage of the fastest type of memory. That means the machine learning framework needs to know how the hardware works in order to optimize it the best. How much of that fast memory does it have? How should I best move memory, move data between those different types of memory to get the most out of it? In many cases, there are optimized kernels out there from research and industry, and they're very effective. For example, flash attention is used widely across the ecosystem. However, they don't exist for all hardware, and as new hardware innovations are released, we need new kernels to take advantage of those hardware. If we look at more niche hardware, like Mac Metal that isn't necessarily used

Segment 2 (05:00 - 10:00)

for machine learning as much, there are even less kernels, and the challenge is even harder. So what is it specifically about Mac Metal that makes it so much different to all the other GPUs? Well, an NVIDIA GPU has two types of memory. It has a slower and a faster memory, and it will have the system memory as well as the GPU memory on the machine. And so the kernel will move the data between these different types of memory. On an Apple machine, there's a unified DRAM pool that's shared between system processes and the machine learning processes that you're doing. There are CPU cores and GPU cores, and they both have access to this same memory. So data doesn't need to be moved for these two different sets of core in order to interact with these memory objects. That said, that DRAM is much slower than NVIDIA's VRAM. In practice, most operations use this unified DRAM and then have thread groups that share data across different threads. If you want to look into this further, you're going to have to research this, and this is really the core of the hackathon. This hackathon is about being a researcher and not so much an implementer. Okay, let's take another look now at how we're going to do this in practice and what it's going to look like. So I'll switch over to my terminal, and I'm going to give you a few examples here. Okay, so in the first example, I just come to Codex here. I start Codex off like normal. It's using GPT 5. 5 Medium, which is a default setting. And I say, optimize the kernel for Mac Metal MPS with Torch and matmul. Use techniques from customizing a Python operation README. So this is a file that I collected from the web. Let's take a look at that. I did some basic research. I came here, and I went into the Torch documentation, and I found a couple of ways of how to optimize kernels. This here is for MLX specifically. I found this old sample from a conference a few years ago. I downloaded that, and I looked at a few different sources. I couldn't really find anything definitive that was an example that I could copy, but I downloaded it to my repo anyway, and I went to Codex. I then gave Codex that file and said, just optimize this kernel, right? So kind of real YOLO, like I don't really know what I'm doing. I'm just going to see what Codex does. Codex inspects the README that it got, and it expects the kernel file that it got there. It's not in a workspace, and it does some basic kind of operations. It says that the first pass is in place, and it's checking the syntax. And then it quickly says, implemented the custom Mac Metal MPS matmul path in matmul. Okay, it gives an idea of what it changed, so it's embedded this Objective-C++ PyTorch extension using metal buffers. This doesn't necessarily sound so good from a high level, and it doesn't benchmark it either, so that's kind of problematic. So okay, I pass here a prompt to benchmark the original and optimized matmul. I want to know the difference. So it runs the benchmark as matmul kernel equals Torch for the original path. Okay, and it does that. Eventually, we get a table here. We say, so the original, this is Torch. Torch's native implementation gives us a mean time of 1. 3 milliseconds, and the custom one is 6. 5, so that's pretty bad. That isn't optimized. So in this case, Codex has failed. Okay, but it did write code, and it looked good from the outside. How can we go further? How can we get more out of Codex? Well, we could prompt Codex more and give it more information. I could do more research and kind of go away and do a better job of finding examples on the web, but that doesn't really feel that agentic. So what we would do here is to use some context engineering. Let's take a look at how I've done that. So in this example here, this is the directory that I'm in. I've created a Codex file,. Codex file, and in that, I've created a number of different agents and skills. So you'll see here that firstly, I created some skills, and they have specific agents here. So in this skill. md file, let's have a look. Optimize custom PyTorch operations for Mac, Metal, MPS. Use when Codex needs to inspect. Implement, benchmark, or review Objective-C++ Metal PyTorch extensions. Okay, and this gives a detailed overview. Spoiler alert, I've actually just generated all of this

Segment 3 (10:00 - 15:00)

based on various things that I've found on the web. And I would suggest that in many ways, that's the best way to get a skill going and then to validate what you see and how relevant that is with how you think that the kernel should go. The winner of this competition will have a very strong back and forth relationship between the context and their own research and the agents that they're defining. So here we have a specific agents definition, and we can go and see all of these agents that Codex has access to. So the first one is Benchmark Methodologist, so Performance Measurement Specialist. Okay, so that's going to measure how well the kernels perform. Correctness Coverage, Read-Only Validation Specialist for Custom PyTorch MPS Operations. Okay, this can only read-only. It's got specific instructions here like stay in read-only validation mode. And we've got a series of different agents that are performing different tasks. As I said, in practice, I haven't gone deep on these and I've just taken what Codex has given me based on its default skill generation strategy. So in order to win this contest, you're going to have to go pretty far on this and define this context so that it works. Let's go back to the terminal now and see what happens when I use this skill. So okay, this is how I created the skill. So I used what in Codex is called the skill creator skill. I said, create skills for optimizing kernels for Mac Metal MPS based on that same README. Define sub-agents to optimize a work. And I gave it the documentation for sub-agents just in case. Define everything within this repo. And Codex did that and it used its own documentation to make sure that it conformed to the best format for its context. And formally, it's done a great job. Everything's completely compatible and it works. And you can see the diff here and you can go through that. So okay, that's created. Let's go to the next terminal now and take a look at how we can use this skill. Okay. So in this example, I've now started Codex with that skill in context and those agents in context. And that setup was called optimize MPS kernels. So I say optimize the kernel for Mac Metal, MPS with Torch and MatMul original. So in this case, I don't need to give any extra prompt because everything is in the context already based on the skill and the agents that I've defined. And you'll see that already it starts off in a different way. It's still worried about not being in a work tree but we get over that. And it notices that it already contains optimizations. Okay, and how does it work? So we define this in Python. Okay, that's not the best starting place but let's have a look. The file now has an explicit kernel selector and dtype selector and validating MPS behavior first. Okay. Works through this. Okay, so we are... Okay, we're essentially still just using Torch. mm under a specific constraint. And what do we get? Okay, so I've just told it prints off the performance of its optimized kernel but it doesn't compare which is kind of bad. So let's see what the optimization is. So in fact, here we are still once again slower. So in some situations we are almost 11 times slower than the default implementation. Now, you've got to remember that the default implementation in Torch is used widely by the community so it kind of stands up that it's a pretty good kernel, right? Like it's not going to be terrible. And you can see that we can't implement a new one here even with our skill. So after this I then went away and added more sub-agents to the skill. specifically I added this validation example and this research agent. So this research agent would go out to the web and find more resources beyond the ones I looked at and the validator here would go and gatekeep to make sure that every single kernel was actually benchmarked and optimized. My thinking there was that by being explicit and telling the agent not to just use Torch integrated libraries that I could stop it from just doing that. Let's see what it does when I use that skill. So same kind of start. Optimize the kernel for Mac Metal MPS with Torch matmull Original and matmull Skill this time. Compare Optimize and Original in the table. So I basically just take the two last prompts and I put them together so that it's a bit more convenient. And it says that it's using the skill and it's going through the same starting process.

Segment 4 (15:00 - 17:00)

The current skill already chooses Torch mm. Okay. So it's already aware of Torch's integrated approach and it does a series of runs to see how it works. So, okay. In this case, this is kind of interesting. It says the first time is confirmed that matmull Skill and matmull Original are currently functionally equivalent. So it's generated a version this time that's in fact exactly the same as the original one just in a different implementation. And so you can see here what it uses is the functional library but it's in fact the same implementation. So, okay. We got a lot of diff. It looks like it's writing a lot of code but it's already said that it's aware of the fact that they're the same. Okay. And then it's going to give a table. It does some more changes this time. This time removing any kind of customizations to it because it has now realized that it's using the same thing and it compares the two here and let's see what it says. So it's implemented it. It says the optimized default now routes MPS matmull through Torch's tuned Torch. mm backend while keeping the handwritten metal kernel available for diagnostics. Okay. So in fact it doesn't even use the optimized kernel and just admits that the unoptimized well the optimized but standard kernel implementation in Torch. mm is in fact the best route. So what we have here are basically a first example where it failed and then two examples where it cheated and just used the default implementation. So that really defines the nature of this hackathon. It's a question of getting an agent to perform a very difficult task that it can't perform as standard and also preventing that agent from cheating and making sure that it stays on track and actually does the task. So if you're looking to take on a really hard challenge with agents that is at the kind of bleeding edge of what they can do as you can see from a couple of runs it's not something that they can just YOLO we really need to optimize and customize them for this task then you should take on humanity's last hackathon. It's going to be a lot of fun we're going to share leaderboards on the hub and the fastest kernels will be shared and made public for the whole community to use. Thank you.

Другие видео автора — HuggingFace

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Дайджест Экстрактов

Лучшие методички за неделю — каждый понедельник