Your RAM Has a 60 Year Old Design Flaw. I Bypassed It.

54:26

Your RAM Has a 60 Year Old Design Flaw. I Bypassed It.

LaurieWired 07.04.2026 915 569 просмотров 56 944 лайков

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Описание видео

Modern DRAM is based on a brilliant design from IBM. But, we're still paying for a latency penalty that's existed since the 60s! In this video, I'm introducing my personal research project (Tailslayer) that immensely reduces p99.99 latency on traditional RAM! By implementing a hedged read strategy taking advantage of (undocumented!) channel scrambling offsets, I've gotten as much as 15x reductions in tail latency. The technique works across Intel, AMD, Graviton, DDR4, DDR5, x86, ARM, you name it. Check out the C++ lib I wrote, watch the video, and try it yourself! --- Timestamps: 00:00 Your RAM goes Blind 01:58 A 400ns tRFC Lockout 04:42 Can we predict it? 08:02 Hedged Reads 15:41 The ROB Trap 19:20 Multicore Threading FTW! 23:19 Memory Controllers Hate You 27:09 A Dark Reason (Rowhammer) 30:07 Reverse Engineering Channel Scrambling 33:44 Where is my Data? (Physically) 38:46 Does it actually work? 45:20 ARM Graviton...the Black Box 48:06 High Frequency Trading --- My project / code (Tailslayer) on GitHub: https://github.com/LaurieWired/tailslayer --- Check out my X account for cool Computer Science stuff! https://x.com/lauriewired

Методичка по этому видео

Структурированный конспект

Оптимизация задержек памяти: как выжать максимум из RAM с помощью метода Tailslayer

Изучение архитектуры DRAM, механизмов обновления и борьбы с tail-latency для разработчиков низкоуровневого ПО за 55 минут.

Оглавление (13 сегментов)

Your RAM goes Blind

Every 3. 9 microseconds, your RAM goes blind. Your RAM physically has to shut down to recharge. You can't stop it. You can't predict it. You can't outrun it. In exactly 3. 9 microseconds, we're going to but you can outsmart it. Croton-on-Hudson, a small town overlooking the Hudson River, about an hour north of Manhattan. Bob Dennard arrives home after a long day at IBM. It's already dark out. He collapses onto the chair. His team is losing. All day he watched a rival team at IBM demonstrate their memory design. It probably wouldn't have worked, but it looked good. Really good. Meanwhile, his team's design would need six transistors just to store a single bit. There had to be a better way. But what if we only had one? He calls his boss at 10:00 p. m. His boss tells him to calm down, take two aspirin, and they can talk about it in the morning. In that moment, right there on the couch, Dennard had just invented the memory system that would end up inside every computer, every phone, and every server for the next 60 years. There's just one dastardly catch, and it's the same catch that your computer currently has. In order to hold that single bit on just a single transistor, Dennard had to store it as a tiny electrical charge on a capacitor. The problem is, it leaks constantly, kind of like a bucket with a hole in it. If you don't do anything about it, all of your data disappears. — So, every couple microseconds, the memory controller stops what it's doing and recharges every cell, filling up all of the buckets. And while it's doing that, you just have to wait.

A 400ns tRFC Lockout

— This lockout is defined by the JEDEC spec as tRFC or refresh cycle time. Now, a regular read on DDR5 might take you like 80 nanoseconds, but if you happen to accidentally get caught by this lockout, that's going to bump you up to about 400 nanoseconds. And that might not sound like a lot, but on modern CPUs, that means you have to burn like potentially 2,000 cycles just sitting there waiting for the drawbridge to go back down. The majority of the time, you don't notice. It's just the physics trade-off that Dennard made, you know, 60 years ago. You're playing a game, you're browsing the web, whatever. You don't even realize that it's happened because it's such an insanely small amount of time. To put in perspective, the human eye takes like 300 milliseconds to blink. In the time it takes you to blink once, your stick of RAM froze and unfroze about 150,000 times. Hold on a second. That's a really bad excuse, and technology never got anywhere by saying, "I accept this, and it is what it is. " Cuz I can think of at least one field that might really care. Stop. Think for a second. What industry might really care about wasting a couple hundred nanoseconds, where one incorrectly timed stall would cost you millions of dollars? That's right, the world of high-frequency trading. We'll get back to that a little bit later. Take a look at this. I slapped together a nice little benchmark program that's going to make the RAM stalls on DDR4 like really obvious, and it's literally just a flush read timestamp on a tight little loop that targets one particular address, and then I hammer that address millions of times, or like 20 million times to be exact. So, on x86, this is going to be CLFLUSH, MFENCE, RDTSC over and over again. Spin it up on a bare-metal cloud instance that uses DDR4, and it should only cost like a couple cents. Also, theoretically, our benchmark code should work on basically any DDR4 system. Now, I'm not going to get into all the math here, but because our cloud AMD processor runs at 2. 65 GHz, that means that we should encounter a spike if we've hit that RAM hiccup about every 20,720 cycles. Side note, measuring time on this level is actually very interesting. Did you know that computers keep track of their own cycles? So, with this level of accuracy, we can measure things on the sub-nanosecond level. And look at that. When you plot the gaps between the slow reads, they're all the same, 7. 82 microseconds apart every single time. And each one of those is one of the refresh stalls happening that we can physically see happening. Those are literally the leaky capacitors recharging. But the question is, if we

Can we predict it?

can see it, can we predict it? Now, I've got a little idea. Now, for the majority of programs, you don't care about these kinds of periodic spikes over here. But if you're working with super low-latency applications, like with high-frequency trading, for example, you don't just care about average latency, you also care about determinism. So, wouldn't it be nice to be able to reduce what we would call these P99 or P99. 99 spikes? Now, it doesn't happen often, but if one of our reads lines up perfectly with one of the refresh cycles, that's going to suddenly add hundreds of nanoseconds in latency to our operation. So, the question is, if this is so periodic, can we potentially predict when the refresh cycle's going to happen, and then try to read around it? You would think that if we put a read here, we could be like, "Okay, the next one's going to be the refresh cycle, so we're not going to read now for a little bit. " Or even conversely, if we put it here, we could be like, "Okay, yeah, we've got like 6 microseconds of room left. We should be good. " It gets a lot more complicated than that. I had this whole theory going where you would just pick different regions in memory that had refresh cycles that were uncorrelated to each other. And then when this one was about to experience a stall, you would just go over and read from the other one and be good to go. It's like dodgeball, where we dodge the spikes that are about to hit us. See, it's not like the whole stick of RAM gets locked when the refresh cycle happens. It's a lot more granular than that. With DDR4, for example, the refresh happens at the rank level. And then DDR5 gets even more complicated, where you can like subsection down even further than that. So, every night I was going down a rabbit trail trying to code a strategy that would work around this. None of that matters for the video. Did it work? This doesn't work at all. This is a really important lesson in statistics. Being able to see a rhythm is not the same thing as having a really reliable countdown clock. We can see the beat after the fact, but what we were measuring previously was the downstream effect. We didn't really have to worry about predicting caches, queues, contention, OS scheduling, and all this kind of stuff that makes our spike a little bit fuzzy. There's also a problem in the JEDEC spec that absolutely kills our ability to predict this. Pulling in and postponing refreshes is explicitly allowed. The memory controller does what's called opportunistic refresh scheduling, which basically means that it can postpone up to eight refreshes and then catch up later if we happen to be in a busy period. So, most of our spikes happen every 7. 8 microseconds, but it's not like an exact metronome, it's more like an average budget. So, even if you get just a tiny bit off, you progressively get worse and worse. And considering that this happens like 100,000 times a second, you progressively get just incredibly off. Basically, for a very, very short window, my strategy would work, but then in a blink of an eye, it would get incredibly off, because how the heck are you going to predict opportunistic refresh scheduling? So, I'm not going to lie, I was really down in the dumps for a while because I didn't think it was going to work, and I almost scrapped this entire video. But mama didn't raise no quitter. I knew I just needed to think a little bit harder. Back to the drawing board.

Hedged Reads

Seattle, a small town overlooking Puget Sound, about 44 hours west of Manhattan. Laurie Wired arrives home after a long day at Google. It's already dark out. She collapses onto the chair. "I can't do this anymore," she thinks. "I've got to read something else. " Changing her routine, she grabs an old ACM issue. It looked good. Really good. Of course, that was in the web server world. Meanwhile, her design in the land of low-level memory was messy, complicated. There had to be a better way. Wait. What if I did the same thing? She calls her mom at 10:00 p. m. Mom, I've come up with a new strategy for RAM reads. Okay. Whatever that means. What do you think about that? I thought that if you thought of it, it's a good plan. — All right, legitimately, all jokes aside, there's something really interesting here. If we take a little bit of a closer look at Google's The Tail at Scale paper, which talks about reductions in tail latency, this was published back in February of 2013 in the ACM Communications magazine, and they use some really interesting techniques, such as hedged requests. Now, I highly encourage you to go back and read the actual paper itself, but the primary thing that you need to take from this is that web users really care about reduction in page latency. Instead of doing a ton of advanced work to try to speed up one single machine, what they do is just duplicate it, send the same web request to multiple machines, and then pick whichever one is the winner. And then they'll just drop all of the loser requests that didn't make it fast enough. This is honestly like a super intuitive, kind of obvious concept that's cool, but unfortunately, it's such a different domain than what we're working with here because they're working like super high level with thousands of different machines, and then they're working with millisecond latencies. We're working on a single box with the lowest level of abstractions trying to reduce latency in the nanosecond range, so orders of magnitude smaller. But, I've got a hunch that it could work. It's always going to be way more fun to explain things with toys. Now, imagine that we have a simple move instruction in assembly, and this entire thing represents our move instruction. It's very accurate. So, you see we have our train path over here, and then our drawbridge is going to represent our TRF stall. So, you can see. Here we go. That's our stall. And then our train is going to represent a memory request being sent from the CPU. Doesn't this just look like a Triceratops? First, it crosses the data fabric portion of our trip. And that's going to last about 15 to 20 nanoseconds before it hits the memory controller station right here. And that can be like 5 to 10 nanoseconds, maybe a little bit longer. And then we switch over to the DRAM portion of our trip, and that's going to be where our drawbridge lives. Oh, no. Here we go. Come on. A little bit of help. If our RAM isn't busy, then we can just, you know, pass on right through and continue on with our request. But, if it is busy and it's in the process of performing a request, then our drawbridge is up and our train is stuck there, and it just has to sit there and wait for the refresh to finish. And this is an excruciatingly long amount of time because we can wait for like 400 or even 500 nanoseconds. We are stalled, everyone. I wonder what would win, train or drawbridge? Oh, apparently train wins. Look at that smoke on that. That is just like the coolest thing I've ever seen in my life. Trains. Once the drawbridge opens back up, the controller can start servicing the request. Now, if the right row isn't already open, the controller issues an activate command. In the JEDEC spec, after activate, you wait tRCD before you can issue the actual read, and that's going to be like 15 or so nanoseconds. Activate. — Good timing. Then we can actually read the memory, and this is where our CAS latency comes in. Column access strobe. I know CAS here does not stand for compare and swap if anybody's into lock-free or wait-free programming. Go take a look at a stick of RAM, and you'll see it has the CL number on here if you can see it. And what this number is this is actually the CAS latency measured in clock cycles. So, this particular one has a CL number of 30, and what that really means is it takes exactly 10 nanoseconds in wall time. You'll notice that more expensive RAM tends to have a lower CL number, and in some industries, they'll pay like a super heavy premium in order to minimize this. And this is one of the only pieces of the track that you can actually reduce. After that, we take a few nanoseconds to go over the DRAM bus, and then finally we go across the fabric to the CPU, which is going to be another 15 to 20 nanoseconds. We made it. I'm going to smoke right here cuz then he's going to go like super fast down, and then he's going to just like break through the smoke. Oh my gosh. — [snorts] — That's just sick. Peak content. So, what can we do? Like I said, this is the only part of the track that we can replace to make a little bit shorter, but even if we do that, if the drawbridge is up, you're still kind of screwed over here. But, what if we built a second track and then had two trains? And voila, same starting point, but we have a completely different channel that we can go through. So, basically what we did is we just have an identical copy of our memory in a different location. That means we can have two different drawbridges as well. And if we're kind of clever about it, we can pick a second drawbridge that has a completely independent or uncorrelated schedule to the other drawbridge. Now, what we're going to do is we're going to send two trains at the same time. So, that means that even if this drawbridge is up, this one is probably down. So, whichever train reaches the end first wins. Okay, let's do our data race. Here we go. And we got this drawbridge. It's going to be the one that's up. Here, a little bit of help. Yep, there we go. Oh, no, this one's stuck. Here we go. We're still reading our data over the CAS. Woo. Come on. All right, he won. And now this one finally goes down and can finish, but he's like super late to the party, so who even cares? He also sucks at his job. For the other one, we just throw away the cargo because we already got what we wanted from this guy over here, so who cares? Now, all this sounds really simple. It's just hedged reads, but shrunk down to the nanosecond scale at the lowest levels of hardware. There's just one tiny problem.

The ROB Trap

So, here's the thing about modern CPUs. They have something called out-of-order execution. Now, imagine we have two different move instructions in assembly, and one of them happens to be reading something from channel A, and the other one B. Well, these requests can actually fire at the same time. What happens is the CPU's out-of-order engine analyzes these instructions and then realizes they're independent from each other. They have different source addresses, and they have different destination addresses. So, it sends both requests off through the cache hierarchy. They both miss in L3 cache and then get sent off to different memory controllers. And boom, both trains are off moving in parallel. They might be like one cycle apart, but it doesn't really matter. I think this is more than one cycle apart. It's perfect, everyone. It's perfect. Completely synchronized. I used to have a recorder growing up, and let me tell you, those are great for kids, but they have to drive everybody else around them absolutely nuts because I remember I would just like have all my fingers down and like have one up, and I just go pff. — [gasps] — You may be saying, well, what's the problem? It seems like the concept should be working just fine, and they're running in parallel to each other perfectly. Well, the problem is for our specific scenario, we're going to encounter a tricky little quirk I think he's racing other data. — [gasps] — We're going to encounter a tricky little quirk known as a ROB stall or head-of-line blocking. The problem isn't like the dispatch portion of it. We blow the whistle, and they both start at the same time, and they're running in parallel. That means if one drawbridge happens to be up, then the other one can theoretically win. Here's the frustrating part. We've got this guy over here stuck by the drawbridge, so this guy's going through. Looks like he's going to win. Doesn't matter if this guy's stuck because this guy can keep on going. But then, he stops right there. And he sees it. The CPU register is right in front of him, but there's just this invisible wall stopping him from getting there, leaving enough time for this guy to mosey on. He takes his time. To then win. So, what just happened? Well, remember, it's not the dispatch portion of this whole thing. It's the retirement. You have to commit instructions strictly in order every single time. And why is that? Well, modern CPUs are constantly speculatively executing hundreds of instructions ahead assuming everything goes right. But, what if something goes wrong? Your CPU has to kind of hit the undo button. But, in order to be able to do that cleanly, it has to know exactly what instructions were executed, in what order they were executed, and up to a specific point. And that is the reorder buffer. It's kind of like this sequential checkpoint system. So, the CPU can't finish processing the fast request until the slow one has also completed. The fast read came back with our data really quickly, and the CPU might even start doing some math with it speculatively, but it can't commit any of those changes because it had to wait on the stupid slow read that got stuck by the drawbridge over here. Eventually, the buffer just fills up completely, and the core just grinds to a halt. Sad. So, no, you can't hedge on a single core. Unless

Multicore Threading FTW!

Well, guess what, buddy? It ain't 2004 no more. We have multi-core processors. But, you might be saying, oh, well, why are you so excited? Well, because it is exciting because separate cores have separate reorder buffers. That's right, everyone. If we fire off these two different reads on two different threads on two separate cores, they can both run, hit a tRFC stall, whatever, doesn't matter, but they're not going to get caught by any arbitrary invisible walls at the very end. So, we solved it, and it probably sounds pretty obviously, honestly, right? So, now all we need to do is write a little program that's going to take our data, replicate it on two separate memory controllers, so that way we don't have the drawbridges hit at exactly the same time. So, piece of cake, right? — [sighs] — So, apparently the operating system doesn't really like it when you can find out physical memory addresses. Something about security. I'm just being dramatic here. Every good programmer knows this already, but it makes it more fun. See, I'm going to go over here, write a little program, allocate some memory, and we'll get an address back. It looks like a real address, it feels but it's not a real address. It's fake. The OS made up this address, and this is called a virtual address. See, every single on your computer thinks it's living in this kind of happy little dandy world where it has the entire machine dedicated to just it and its execution. And it kind of makes sense because if programs could mess with the physical memory of other programs on the machine, it could cause like a ton of different problems. The real address, the physical one that actually corresponds to a location on your RAM stick, is hidden from you. So, behind the scenes, the operating system is performing this entire remapping process, and it really doesn't want you looking at it. It's shy, everyone, okay? Anyway, this wouldn't be a problem. Who cares? Unless you're trying to do something really specific, like, I don't know, trying to make sure your data appears on two separate memory controllers. Let's say I want to run a little experiment using offsets, and I take two copies of my data, and I space them nicely 128 bytes apart, and I'm feeling pretty good about myself. But for all I know, it could be straddling a page boundary, and then the OS could have decided to put them wherever it felt like putting them, and with that, my entire hedging strategy is completely ruined. — It's okay, Lori. It's okay. — [snorts] — Now, luckily for us, there's a cheat code, and this cheat code is called huge pages. So, normally, if you're working with Linux, for example, the OS will give you 4 kilobyte chunks kind of scattered all over the place. So, you might think you have this 1 megabyte of contiguous space altogether for yourself, but in reality, it's just 256 scraps scattered all over the place. I'm Linux, and this is what I do to your RAM. As soon as you cross a page boundary, everything kind of gets messed up, and you could get theoretically transferred basically anywhere. But if you ask nicely, and by nicely, I just mean you pass an additional flag over to a system call, you can get a huge page. And a huge page is one big contiguous block of memory. And more importantly to point out, it is one physically contiguous block of memory. Now, virtual address plus 128 also equals physical address plus 128, as long as you're working inside of the huge page that we've allocated. And we've allocated so much that it's actually large enough to be able to play around and experiment inside of. So, yeah, no more vibes, just math. Everything's going according to plan.

Memory Controllers Hate You

— Unfortunately, there's yet another layer of abstraction that we have to deal with to make this work. Most computers these days use a 64-bit addressing mechanism, but in practice, if you're looking at x86-64 systems, in reality, they only really use 48 physical bits to represent the memory addressing portion of it, or 52 bits if they're like a really large system. But even just using those 48 bits, that still gives you the ability to address 256 terabytes of memory addresses, or even 4 petabytes of memory addresses if you're going for the 52-bit option. For our purposes, that means that somewhere inside of those 48 bits, some combination of values is going to determine what memory controller that your data is actually going to end up on. I don't think that's a valid memory address. Okay, that was my spilled water. Remember how we're trying to get copies of data for our hedged reads to end up on basically two different drawbridges? Well, that kind of implies that we need to reverse engineer how the memory controller really maps these addresses over to the real physical locations. And you might be thinking this is probably like some predictable linear type thing, like maybe uh the lower bits correspond to the row, the middle bank, the high bits correspond to the column. It's really not like that, like at all. Every modern CPU, whether that's Intel, AMD, ARM, doesn't matter what it is, takes your address and runs it through a set of undocumented XOR hash functions, which scrambles up where your data actually goes. So, even if you bypass all of those OS-level primitives that are actively working against you on this, and you do end up with two addresses that are physically right next to each other, let's say like uh address hex 1000 and address hex 1040, which are only like 64 bytes apart from each other, they could still end up on completely separate RAM sticks. What's even worse is even if you did take two physical memory addresses that were really far apart from each other, let's say we've got address hex 1000, and then 8001000, which is 128 megabytes apart, they still could end up on the same stick, channel, and bank. So, why the heck is that? Well, with everything we've been doing, there's been a good reason for it, and same thing goes for this one. It just makes our specific use case really difficult. First reason why all memory controllers do this kind of scrambling is performance. So, here the hardware engineers are kind of compensating for potentially lazy programmers. Now, picture this. Let's say you're stepping through a really big array, and all of the elements inside of the array are evenly spaced, and you're performing all of your mapping in huge pages. If the memory controller didn't do this kind of scrambling, then common access patterns like stepping through an array sequentially would continuously hammer one bank of RAM over and over again, while all the other banks and subchannels just sit there completely idle. You have to remember that these might look like just two sticks of RAM, but they're actually broken up into four independent subchannels, each with 32 banks. So, the XOR hashing phase kind of acts like a load balancer baked like directly into the silicon itself. Takes in your physical address, does a little bit of scrambling, and tries to spread it out evenly across all of the banks and channels. I'm your friendly neighborhood XOR hashing load balancer. That's for DDR5, by the way. Even if your lazy programmer wrote a memory access pattern that's super linear, the hardware makes up for it on the other side, adding all of this kind of randomization. There's also a much darker reason.

A Dark Reason (Rowhammer)

In 2014, at Carnegie Mellon University, a PhD student named Yun Gyu Kim was studying DRAM scaling issues. Their group was specifically studying DRAM reliability, — and they wanted to know how memory would behave as modern-day node sizes progressively shrunk and got more dense. Now, what they did is they created this neat little test bench memory controller that would allow them to issue raw DRAM commands, so they could specifically target individual rows and watch what happened. Turns out, if you open and close the same memory row fast enough, you can flip bits in the row next to it. Basically, aggressive activity in one place physically disturbs the neighbors. They published their results in their 2014 paper called "Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors. " That's a mouthful. They just [snorts] called this technique row hammer, though, and the danger was clear from it, but they deliberately stopped short. In section four, they state, "With some engineering effort, we believe we can develop code 1A into a disturbance attack that injects errors into other programs, crashes the system, or perhaps even hijacks control of the system. " A few months later, a few researchers on Google's Project Zero team took this technique and weaponized it against the operating system. So, you could take like a completely unprivileged process, no root, no admin, no nothing, just a normal process, and completely take over the machine just by reading their own memory in the right pattern. The scariest part is that memory manufacturers probably already knew about the problem because in June of 2012, before Kim even published his paper, Intel filed a patent called row hammer refresh command. That means that they were already filing a patent for the fix before anybody even knew there was a problem. Spooky, right? Here's why I told you all of that. So, DRAM hashing strategies were already not documented publicly, but then after the entire row hammer thing, obviously, there was even less incentive to publish these load-balancing math strategies publicly. To pull off the classic double-sided row hammer technique, you have to have and hammer two physical memory addresses on two different rows and sandwich the target in between those two rows. But in order to make that happen, you have to understand how the memory controller works and how the memory addressing works. Kind of like a security feature. It's not a bug, it's a feature. And my project is just kind of like an unintended victim. Now, even though I'm not trying to do any kind of malicious bit flipping, it really doesn't matter. Intel and AMD are not going to make it easy on us regardless of the intent behind the project. So, I did what any reasonable person would do. Buckle down, pull out a little bit of math. We're going to reverse engineer the undocumented channel scrambling for Intel, AMD, and Graviton.

Reverse Engineering Channel Scrambling

Good news first. So, for my hedged read theory, I don't need to have the same level of granularity that you would need to have for performing an entire row hammer attack. All I need to do is make sure that I have multiple copies of data that end up being relatively uncorrelated locations in memory. Honestly, even just like two separate channels would be totally fine. That way, I can make sure that I don't encounter a stall at the same time when I'm performing an access. And unfortunately, there's a lot of different bundles of reasons why CPU manufacturers make this specific idea difficult. What we're doing is not exactly in the user manual. Another reason is competition, because knowing the hash function kind of reveals some of the different architecture and design trade-offs that you made, and you don't want to just like openly publish that for your rivals to be able to see. You might have different design decisions and philosophies for how your memory controller handles locality and distribution, for example. So, if you understand those hash functions, you also kind of understand the core architecture trade-offs. I'll show you some of the results shortly, but one tidbit that I did figure out is that AWS's custom Graviton arm chip uses exactly the same channel addressing as AMD's epic, which is x86. So, the bit positions are completely identical. And even more interesting than that is Intel uses a completely different strategy. So, there's not like some physics reason that's saying why everybody needs to use an identical strategy. It just goes to show that behind all these closed doors, these architectural decisions are actively being made by CPU designers. But you have to do what I did to be able to see them. But for all of this as well, it's also about performance flexibility. This address hashing can change per skew, per CPU stepping, per BIOS setting, per microcode update. If AMD and Intel documented this kind of stuff, they'd kind of be like locking themselves into a strategy, because customers would start to build against it. And then next year when it comes around, it's really going to make your life difficult, because you're not going to be able to change things nearly as easily. But if you just don't document it, well, who's going to complain? Only weirdos doing crazy stuff like me. There's also the security angle. It's kind of like security by obscurity, and I am the obfuscation person, after all. But pretty much, if you keep all of this undocumented, it's going to raise the bar significantly for being able to perform these row hammer style attacks, even if there are also other mitigations as well. But basically, from the CPU manufacturer perspective, there's no legitimate use case. Nobody at Intel is thinking, "Hey, some random girl in Seattle wants to replicate data on two different channels and race them. " The vendors all assume that their address hashing is transparent to developers. And this is true for throughput, but I care about gosh darn tail latency. It's okay, cuz we can still do this. I got this. — I started with AMD first, mostly because my editing PC is Zen 4, and it's kind of nice to be able to really hammer at a machine without that voice in the back of your head saying, "Oh, you're wasting cloud dollars. " Here's the thing that most people don't know. Inside of your CPU, right next to the memory controllers, there's actually tiny little hardware counters, one for every channel. They tick up with every read that goes through that channel. Now, AMD mostly put them there so performance engineers could monitor bandwidth. That way, you can run your simulation and be like, "Oh, there's too much traffic going through on channel three, and it's

Where is my Data? (Physically)

overloaded. I guess my traffic's not balanced. " And no CPU manufacturer really cares if you get this kind of information, so they don't mind exposing it, cuz it's kind of mundane. If we do a simple pseudo modprobe AMD uncore, it reveals those hardware counters to the standard Linux perf tool. Now, if you've never used perf before, it's a pretty cool tool. We mostly use it for measuring cache misses, branch mispredictions, CPU stuff kind of like that. But if you load this module that exposes those extra hardware counters to you, which most people don't even know exist. Now, you're one of the special few. Here's the thing. If I write a tight loop of code that constantly flushes the cache and hammers one particular memory address, then that means one counter should start to light up. And theoretically, this should tell us exactly what channel that our data is living on. And the code for this is kind of deceptively simple, but you really have to understand all of the previous architecture knowledge to get to this point. The flow is basically allocate a huge page, cuz remember that gives us a ton of physical real offsets that we're able to work with. And then we pick an address somewhere inside of that huge page that we've allocated. And then we have our nice little x86-64 inline assembly loop. Inside of that assembly loop, first we need to flush the cache and evict that. So, that's our nice little CLFLUSH instruction. Otherwise, the CPU might cheat and try to read the value from L1 cache. And in this particular case, we kind of want the inefficiency. Then we wait for the flush to finish. We force the read, and then we're complete. Moment of truth. Let's go. That's kind of weird. Can't really tell what's going on here. Well, that, my friend, is OS noise. Even though we have a really minimal Ubuntu install, there's still like a ton of processes happening in the background performing reads that could be interrupting us and causing additional noise. The problem is these counters are pretty dumb, so you can't tell it only count the reads from this particular process. So, what does that mean? Is it still over for us? Nope, not at all, and the fix is actually really easy. All we need to do is run it 50,000 times. And that might sound like kind of a lot, but it's really not, because CPUs are so fast, this only takes like a few milliseconds of wall time. In fact, it's so fast that I can't even show you a loading bar. Wow, see that spike? Super cool. And now I really know where my data lives. But here's where the fun part really begins, because let's flip one bit and do it all over again. Hold on a second. What about all the XOR hashing stuff? Isn't it more complicated than that? Well, technically, yes and no. For row hammer, you need to know the entire hash function, every XOR, every bit. Basically, the entire mapping from address to row to bank, since you need to know exactly where you're going to land. I do not need this information, since I'm doing a performance engineering project. So, to me, I don't really care which channel I'm ending up on, whether that's channel three, channel seven, whatever. Doesn't matter to me. All I need to do is make sure I'm ending up on different channels. So, that means I'll end up with different drawbridges, and so their stalls are going to happen at different times, and my hedged read should work. The mathematical answer is that XOR is linear over GF2, which is actually really simple. Basically, that means that no matter what scrambling the memory controller does, flipping a base bit will always flip the output, no matter how many things are chained together. And that means that we don't have to reverse engineer the entire black box. We just have to be able to observe the input and output behavior. This here represents my AMD Zen 4 memory controller. These represent my physical memory channels on my AMD system. And I'm on a consumer box, so there's only two of them. And then these are going to be the bit positions six, seven, and eight. And I'm going to do this in demonstration form and in the code, so it makes complete sense. All this is running on my Zen 4 system. If you know binary, I want to flip bit six. So, I'm going to offset my data 64 bytes. Same channel. Test our code again. Yep, same channel. Bit seven, 128 bytes. Test our code. Yep, same channel. Bit eight, 256 bytes. Different channel. So, wow. 256 bytes apart, and we end up on a completely different memory controller. Now, if we do this for every single bit, we end up this building up this kind of map about how the memory controller routes your data. Some of the bits are going to matter, but most won't. But essentially, extend this out for 24 more bits. Anyway, 256 bytes apart is all it takes for this particular system, which doesn't really sound like that far apart. But that means they're going to end up on completely different channels and controllers. And most importantly, they're going to have independent refresh cycles, aka different drawbridges. So, that's AMD. And once the code is written, it takes like 30 seconds to map out a system. So, really, the hard part here is knowing what you're looking for.

Does it actually work?

After all that, the virtual memory, the huge pages, the bit mapping, the XOR hashing, etc., you know what my hedged read strategy looks like? It's just two addresses 256 bytes apart, pinned to two separate threads, and then you just take whoever wins first. I think there's a song for this. It's like, "It's so simple that it's complicated. " So, let's see what happens. Side note, I cannot stand Lil Wayne. Absolutely not. But you know who's really good is Childish Gambino. Let's take a quick look at our benchmark code. Now, the majority of this code is just extraneous. It's related to — setting up the benchmarking portion of it and then performing the measurements after the fact. But if you want to look at the core portion of this, it's surprisingly small. Just take a look at the measurement thread function inside of our benchmark class. And this is performing the actual read portion, as well as the measurement of the amount of cycles that something took. If we dig further into the benchmark portion of this, we can see we have this kind of what I like to call a benchmark sandwich. So, we start out with our little CL flush instruction, cache line flush, evicting that from all levels of the cache so we can make sure we're reading directly from RAM and not messing up our benchmark times for that. Then after that, we have our barrier instructions, these fence instructions over there, to make sure that we're not executing instructions out of order, so we do actually get like a correct benchmark time. And then of course, we have our time reader, we have our RDTSC, read timestamp counter. And then at the very end of that, we're going to be subtracting these times from each other so we can get the total number of cycles that it took to perform the operation. And this is going to be our wall clock time, so it's going to include the stall if we happen to get that during our actual RAM read. And then at the very heart of the code in the center, sandwiched by all of this, we have our actual read of the address. So, this is dereferencing the pointer and storing this inside of our valve variable. All right, 20 million samples on my 7950X3D dual channel DDR5 Zen 4 Ubuntu 2404 machine. Let's go. This is what's called a CCDF chart. It's actually pretty simple to read. Just pick any point on the line, and the Y axis represents what fraction of reads were slower than that. And then each tick on the Y axis is another nine. So, you have your P99, P99. 9, P99. 99. So, basically, each tick down is 10 times rarer. The red line we have plotted right now is the latency distribution of a traditional read. Basically, further right and up is worse, and further left and down is better. Now, let's plot my hedged read in blue. Wow, it really works. So, if you take a look at our P99. 99 over here, so that's our one in every 10,000 reads, you can see the regular read strategy took 631 nanoseconds, but my hedged read strategy only took 281 nanoseconds. So, it really works. If you go down to the really rare stuff, you'll see that our regular read is getting reads up to a microsecond long, which is way worse than our hedged read strategy. Now, whether that's because of controller contention, a stall cycle, or a bank conflict, or whatever, doesn't matter because that's happening on one channel. If we go to our hedged read strategy, we still have another whole channel, so we don't care. This is an insane difference in tail latency. So, we've really proven that it works. On my system, at least. But, does it work everywhere else? Time to rent some hardware. Hey, guess what? More channels actually equals more hedging options. So, on my wimpy consumer board, I only have two channels. But, most servers have way more channels than that. So, as long as we figure out where each channel is, we can expand our strategy to have more copies of our data. So, it should scale really well. Because the AMD uncore performance counters are pretty stable across AMD generations, I really didn't have to change my channel discovery code that much to be able to support AMD epic systems. Really just comes down to more combinations of bits and offsets. As of today, AMD Turin, which is Zen 5, is the latest available AMD server architecture. So, we'll just go rent a spot instance real quick. Let's bare metal. This one is — C8A metal 24xL. And it's only like a few dollars. It just blows my mind. Running it again. And here we go. Wow, so our blue line is almost vertical. That means our tail latency, our worst latency here, P99. 99 and further, is almost the same speed as a regular DRAM read. So, that's like absolutely crazy. That means our tail latency at P99. 99 is literally nine times better than our regular read. So, our strategy is scaling really well. Keep in mind, that's what the stress threads hammering the system. But, the crazy part is that even with a completely quiet system where nothing else is happening, we still get a five times speed up, which is crazy. Well, you have to think about it that the refresh stall doesn't just go away just because the system's quiet. No matter how you slice it, it works. We're actually avoiding the refresh stalls in software. So, you might be saying, "Well, that's neat and all, but what about Intel? " Well, fortunately for us, Intel has the same kinds of per channel counters. They're not called the same thing, but they work in basically the same way. So, all we have to do is build the tool, scan the bits, make the map, and then once we have the offsets, we can run our benchmark code on a lot of different cloud instances. And watch this. Two-way hedge, four-way hedge, and eight-way hedge. It's the perfect result. It's basically like a straight vertical line. So, for our P99. 99, it's only 113 nanoseconds. That means it's 15 times lower latency at the P99. 99 than our traditional read. This was the Sapphire Rapids C7I metal 24xL, but it also worked on the latest Intel generations I could find, too. Granite Rapids C8I metal 48xL. — AWS runs their Granite Rapids instance in SNC3 mode, which limits how many channels we can use for our hedged read, bringing us down from 12 to four. If I could get into the BIOS to disable sub-numa clustering, then it could be even more efficient. But, it's still pretty darn good with a 7x improvement.

ARM Graviton...the Black Box

Now, here's where it gets really interesting. Previously, when we've been working with Intel and AMD, they've both had those nice little hardware performance counters available for us to use to be able to map our channels. But, what if we didn't have those performance counters available to us? AWS's custom Graviton ARM chip is super cool, but it's like essentially a black box. They don't have any custom modules you can load, performance counters, and no documentation. How do you reverse engineer something when nothing is exposed to you? Well, I might not have my performance counters available, but I do have something arguably better. The hedged read itself. So, if you think about it for a second, I don't know which bits correspond to what channel I'm on, but if I keep on executing my hedged read and measuring the tail latency, and then iteratively flipping bit by bit, I can still build out my map. If my two copies of data end up on the same channel, then we should see no improvement at all inside of our hedged read strategy. But, if they do manage to end up on two separate channels, then we should be able to see a large improvement in our tail latency because, again, independent refresh cycles. So, we sweep every bit. Bit six. Flat. Bit seven. Flat. Bit eight. There we go. And there it is. And the tail latency improvement is my signal. No documentation, no performance counters, just statistics and timing. It's kind of funny. You use the thing I was trying to build as the tool to figure out how to build it. Inception. Escher would approve. I am a strange loop. Now, I had to do a bit of extra statistical analysis under the hood, a pairwise independence test to make sure what we were seeing was actually different channels and not just noise that we were looking at. But, you're welcome to check all of that out on GitHub, same goes for all the rest of the code for this video, which I'll link to in the description of this video. But, the point is, we were able to figure all of this out with just math. Hey, remember when I mentioned you'd learn some interesting tidbits about CPU architecture from this research? Take a look at the bits that I figured out in Graviton 4 in AMD's Turin. Interesting, right? Make of that what you will. Now, we take a look at our Graviton CCDF, and yep, we get almost a perfect vertical line again, which is a 9x improvement for our P99. 99. And at this point, I'm pretty confident to say that this technique works on basically everything. So, I went ahead and tested this on about as many machines as I could get my hands on. And long story short, a few bare metal cloud instances later, we have Intel, AMD, DDR4, DDR5, x86, and even ARM. That tail latency's gone. Once again, research for each platform is all available on GitHub if you'd like to check it out.

High Frequency Trading

Well, I think that this technique deserves a name. So, I christen it Tailslayer. And you might be thinking, "Well, graphs are well and good and everything, but how could you potentially use Tailslayer in real life? " Well, remember that HFT scenario that I mentioned previously? Imagine this. You're running an options trading strategy. Maybe you have a bunch of CPU cores or even GPUs that are working on some ML algorithm that's pre-computing some massive pricing table, and like really big, maybe tens if not hundreds of gigabytes for your working set, aka way too big to fit in L3 cache or in the SRAM of an FPGA. A packet of market data comes in, and you're going to have to read one specific value from your giant pre-computed table that's sitting there inside of DDR5. But, the problem is you basically have no way of being able to predict what entry you're going to need until that market data comes in. So, the problem is this is basically going to be guaranteed to be a cache miss, and you're going to have to read from RAM. Here's the setup. We're using a specialized NIC, something like this SolarFlare X2522 onload, where you can program multiple receive queues. You've got one packet of multicast data coming in, and that gets sent to two cores. But, you have strategically used the offset mechanism, so you have two copies of your massive data, but they're at different offsets, so they're going to end up on different channels. Now, what happens is the packet comes in, each core independently parses that packet, sends off the read on the two different independent channels, and then computes the order. If core zero ends up hitting a DRAM refresh cycle, aka getting caught by the drawbridge, well, core one doesn't care because it's on a completely different channel, and it's completely independent as well. So, if core one ends up finishing first, it just goes ahead and fires off the order. On some specialized NICs with programmable FPGAs like this Xilinx X25, you could also program some logic into the NIC itself to be able to deduplicate packets. That way, when core zero finally finishes processing its request after getting caught by the drawbridge for a while, like 400 nanoseconds later, when it finally does send off its order, you know that your custom logic can just go ahead and drop it. You've basically programmed it to drop all loser packets. So, that way you don't have to worry about any kind of like double send problems. You also don't have to worry about any kind of synchronization between the cores because the point is they have to be independent. As soon as you start trying to add in atomics and stuff, it just completely kills the point of tail slayer, and I'll get into that in like just a second. Pros and cons of this. So, if you're using this kind of methodology and you're using tail slayer and stuff, this does require you to burn an entire extra core for processing your RAM reads. But, if you're running on like a gigantic server that has like 96 cores, eh? The point is you've really reduced your tail latency here, and it's very consistent. So, you don't need to worry that one random DRAM refresh cycle is going to completely mess you up. On these hot pads, a 400 nanosecond stall is a very large amount of your processing time, and tail slayer can theoretically kind of help mitigate that. Now, what you're trading is a very large amount of data that you might be duplicating. Like, if you're having two copies, you're doubling your RAM usage, and you might even be using more depending on how many copies you're making, but you're trading that for near determinism in latency. I've done a lot of research into this topic, and no public documentation exists for it, so I genuinely do think that tail slayer is a new methodology. Or, if by any chance some company happens to be using it, they're not documenting it publicly, and they're kind of keeping it a secret. In which case, sorry for revealing it. Now, the moment you've all been waiting for. Time to see tail slayer in action. Well, kind of sort of. It's like a simulation of action. But anyway, this is not really the benchmark code officially. This is how you might use tail slayer in real life inside of your code. If we take a look at our worker function, it still looks similar to our benchmark code, but we've stripped out a lot of the extraneous stuff, and we just have that kind of sandwiched center portion that's doing the read. The first thing we have to do is pin it to a core that's sitting there waiting, waiting, so it's ready to just fire off the read as soon as whatever data happens to be waiting for comes through. So, you're going to pin it to as many cores as you need to have duplicated data for. Don't forget, the number of duplicates equals how many channels you're going to be using. So, you have kind of this upper bound where use more channels than you have available on your system. Inside the code, I had to have some kind of unbiased timer. Normally, that would be a packet incoming or something, but I just used like — the number of cycles that have elapsed since we started it to signal off each independent read, which does work, but it's really stupid, so don't do this. But anyway, inside of here, we have our actual read portion of it, and then I had a some little bit of additional benchmark code left. But then, at the very end, we would do our processing, which I just did std:: cout, but again, don't do that. Process the packet, and then send it off. The point is that all of the processing logic is interleaved inside of this. So, it's kind of just happening like a spinning on the core, and then sending it out. That's my interpretation of it. Hehehe. — For the story that I promised, I tried to have just the most minimal amount of synchronization between the threads. So, I had some just one atomic variable that I was sharing, and then the main thread would know who got their data first, but the overhead of adding the atomic inside of my code just completely ruined the entire point of tail slayer. But, if you go through and you have all these independent copies on independent cores, then tail slayer works beautifully. As always, thank you so much for watching, everyone. I hope you enjoyed watching this video as much as I enjoyed making it and doing all of the research for this project. Don't forget to leave a comment below if you have any ideas on how this project could be extended or potentially used in more real-world scenarios. Till next time, we're wired out. You can shop at five stores or just one. — Oh, damn. That was the data fabric, sir.

Другие видео автора — LaurieWired

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник