# Rust Parallelism with Rayon - Use ALL CPUs

## Метаданные

- **Канал:** Code to the Moon
- **YouTube:** https://www.youtube.com/watch?v=ZC6UWzX3Xug
- **Дата:** 23.04.2026
- **Длительность:** 13:21
- **Просмотры:** 11,751
- **Источник:** https://ekstraktznaniy.ru/video/51627

## Описание

The fundamentals of using the Rayon crate to speed up your Rust application.
Check out PostHog here: https://go.posthog.com/cttm

Keyboard: Glove80 - https://bit.ly/3EKyn7X
Camera 1: Canon EOS R8 https://amzn.to/4gSpivt
Camera 2: Canon EOS R5 https://amzn.to/3CCrxzl
Monitor: Dell U4914DW 49in https://amzn.to/3MJV1jx
Microphone: Sennheiser 416 https://amzn.to/3Fkti60
Microphone Interface: Focusrite Clarett+ 2Pre https://amzn.to/3J5dy7S
Tripod: JOBY GorillaPod 5K https://amzn.to/3JaPxMA
Mouse: Razer DeathAdder Elite https://amzn.to/4tu57ul
Computer: Mac Studio M4 Max https://amzn.to/44RWIWK
Lens: Canon RF35mm F1.8 https://amzn.to/49XHWkT
Caffeine: High Brew Cold Brew Coffee https://amzn.to/3hXyx0q
More Caffeine: Monster Energy Juice, Pipeline Punch https://amzn.to/3Czmfox

## Транскрипт

### Segment 1 (00:00 - 05:00) []

This simple Rust program takes about 3. 2 seconds to run. I'm going to speed it up by a factor of eight using two lines of code. So, up here we're going to do use rayon prelude star and down here we're going to do dot into par_iter. Give that a run. Boom, 459 milliseconds. A speed up of about eight. And this is using something called rayon parallel iterators. So, how many parallel buckets is this breaking my data into? Well, in this particular case, it's going to be 10. And that's not a magic number. 10 happens to be the number of logical CPU cores on this computer. This is an M1 MacBook Pro. By default, rayon is going to create a thread pool with a number of threads equal to the number of logical CPU cores on the system. If you're familiar with Tokio, Tokio does exactly the same thing when it creates its thread pool. But, I want to point out that rayon is intended for blocking synchronous CPU-bound operations, as opposed to Tokio, which is optimized for asynchronous non-blocking operations. So, you don't want to put something like a hash computation in a Tokio task. That's a good use case for rayon. Now, parallel iterators aren't the only way to use rayon, but they are the simplest. You just replace your existing iterator with the rayon equivalent. There is a function called rayon join that allows you to parallelize more generic computations. We'll look at that in a little bit. We're going to look at a infamous Twitter post related to a Google interview question, a notorious Google interview question, and we're going to look at how to parallelize that. So, stick around for that. So, should we just go and replace all of our iterators with parallel iterators now? The answer is no. There's some tradeoffs here. For example, if I change this range size to the upper bound to 10, let's see what happens. So, this is with parallel iterators, we get 456 microseconds. Remember that number. Take it out. And we get 36 microseconds without parallel iterators. So, in this particular case, the overhead of the parallel iterators is drowning out the benefits that we get from it. And the two things that typically make parallel iterators more valuable are number one, the data size, and number two, the nature of the computation that you're doing on the data. So, in this example, the data size is relatively large. We're walking through 10 million numbers. The computation we're doing is relatively straightforward. It's not super heavy. We're just multiplying the number by a random number between zero and 1,000. If this computation were something a little bit heavier, like a hash function or something like that, the data size at which parallel iterators would become valuable would be lower. It could be something like 100 or 1,000. So, those are the two main factors that make parallel iterators potentially more valuable. Something I wanted to point out that might not be immediately obvious is that this sum here is actually parallelized, too. So, we split our data into those 10 buckets, perform the map operation, and then we can actually perform a sum for that individual bucket, and then we can take all the sums from the buckets and sum those up together. So, this is a specific example of a reduce function. A reduce function is a function that takes two parameters of the same type and produces an output of that same type. Formulating an algorithm as a reduce function can make that algorithm really easy to parallelize. Quick note, there is a mutable version of parallel iterator. In this example, we have this small vector and we're multiplying each number by three. Just wanted to point out that there are mutable versions of the parallel iterators. So, that's parallel iterators, but operations on collections of data, filter, map, reduce, things like this, are not the only scenarios where parallelism is beneficial. Consider this simple program where we're actually spawning an operating system thread right here. Spawning operating system threads and destroying them is always best avoided if possible. There's a lot of overhead involved in spinning up threads and destroying them. That is the whole premise of crates like Tokio and rayon. But, in this program, we just have this CPU-bound computation. It's actually just a sleep, but from the computer's perspective, it's a blocking operation, potentially CPU-bound. We're spawning a thread that's working on that sleep, and then we're running another sleep in the main thread. Then, down here, we're actually joining that spawn thread. So, that's going to block the current thread until the spawn thread is complete, and then we're checking the elapsed time, right? So, we're doing two sleeps of two seconds. The total runtime of the program should be two seconds. Sure enough, it is. That proves that both of these sleeps are happening in parallel. So, what is the rayon equivalent of something like this? We don't have any collections in play. There's no opportunity to use iterators. The rayon version looks like this. We use rayon join. And rayon join is great for situations where you want to parallelize computation, but you might not necessarily have a collection of data, or you might have some

### Segment 2 (05:00 - 10:00) [5:00]

computations that depend on the results of other computations, but there's some computations that are independent, and so you have kind of a tree of computations. We'll see more of that in a second. Both parameters that we pass to rayon join are synchronous closures. Rayon does not deal with asynchronous functions like Tokio does, so that might be a relief to some people. Two synchronous closures. The first one is a closure that will actually be immediately run in most cases, not all. In most cases, on the current thread. So, that operation is going to be kicked off immediately. The second parameter is an operation that's going to be put on the thread queue. So, there's a potential that the second operation will be run on the current thread, but if there's other threads available to do work, they're sitting there idle, they could steal that work from our current thread's queue and start working on that operation. So, effectively, these become parallelized as long as there's a thread free to operate on that second operation. Let's go ahead and run this, and then we see total runtime of the program is two seconds, as we expect. So, how is this better than the previous example where we were calling thread spawn? It's ideal to use threads in an existing thread pool instead of creating and destroying operating system threads. Much less overhead. Now, let's look at this famous tweet from a while ago. If you're on a Mac, you're probably familiar with Homebrew. Well, the creator of Homebrew was very upset that Google asked him to invert a binary tree on a whiteboard. So, let's invert a binary tree. It might not immediately strike you as an algorithm that's right for parallelization, but it turns out that it can be. Now, in this video, we've been talking about performance optimizations, but here's the hard truth. Performance is not going to matter if people don't like your product. That's why I'm very excited to say that this video is sponsored by PostHog. Folks, they have a Rust SDK. They get it. PostHog is a product analytics platform that gives you user session replay, feature flags, experiments, logging, metrics, and more, all in one stack. It is really hard to think of a business that doesn't need these things. They make a concerted effort to be lower cost than competitors. That's a pretty big deal, because the last thing you want is to forego valuable metrics and usage data just to get your monthly bill under control. I mentioned it includes logs, so you can see what your back end and browser are doing and tie that directly to user behavior. PostHog AI can help you glean actionable insights from user data. It can summarize user sessions and surface patterns automatically. Together, that lets you understand user behavior, so you can quickly zero in on their pain points and make your products a joy to use. For engineers who want one stack for analytics, logs, and AI-assisted analysis, PostHog is a strong, strong default choice. Thank you again to PostHog for sponsoring this video. So, this is what it looks like to invert a binary tree in Rust. It's going to take a mutable reference to an option of a box of a tree node. A tree node is just a struct with three fields, data, left child, right child. The left and right children can be some or none, depending on whether they that node has children. Very, very simple algorithm. We're going to first going to check if the tree node is some or none. If it's none, we're not going to do anything. There's nothing to invert, right? If it's some, we're first going to invert the tree rooted at the left child. Just for context, we're actually creating regenerating a full tree. I think it's called a perfect tree, where every node has children except the leaf nodes, which have zero children. And we're going to create a tree, a perfect tree, with depth 23. So, this is going to be a fairly large tree. It's going to have over 8 million nodes. The pattern that we're seeing here, where we have a tree of computations, where one computation might depend on the result of computation lower in the tree, but there's still some work that you can do in parallel. I think that's relatively common in general. So, let's run this and see how long it takes to invert that tree. 72 milliseconds. Okay, remember that number. So, what does it look like to parallelize this? Well, we can use rayon join. We have those two recursive calls to invert tree. We can just put those straight into the parameters of rayon join. So, the first closure is going to run immediately, and the second be queued, and it might be picked up by any of the rayon worker threads to start work on that. So, these are effectively going to be parallelized, right? Pretty straightforward. Let's run this and see how long it takes to run. 281 milliseconds. I think the non-parallelized version was something like 70 milliseconds. Just check that to make sure. Yeah, about 70 milliseconds. So, what's going on here? Well, rayon join has to queue up all of these parallel operations. And this tree, again, is going to have over 8 million nodes. This is a huge, huge tree. Unfortunately, there's a lot of overhead in just queuing those operations. Just making a queue of millions of nodes is going to incur a lot of overhead. On top of that, the rayon threads have to use data

### Segment 3 (10:00 - 13:00) [10:00]

structures to They have to be able to access each other's queues cuz they steal work from each other. Doing that probably requires some atomics and some mutexes, things like this. One common solution to this is just to be selective about what you parallelize. One way to approach this in an algorithm like this is to come up with a depth threshold where once you go beyond a depth, you do everything in serial. You don't parallelize anymore, but below that depth, you parallelize everything. So, this is what that looks like. We've added a depth parameter to the function and we check the depth parameter. If it's less than six, we call rayon join passing invert tree to each of the parameters like we were before. If depth is above six, we call invert tree in serial like we were before we added rayon into the picture. So, let's see what the run time of this is. 15 milliseconds. That is a substantial speedup. What was the run time of the original one? 70 milliseconds. Okay, so pretty substantial speedup. Again, inverting binary trees, not the most practical algorithm, but this pattern, uh how to approach algorithms that involve these kind of hierarchy of computations, very applicable. Now, we talked about Tokyo and how Tokyo and rayon are meant for very different use cases. What would it look like if we used Tokyo for this use case? Well, turns out it's possible. I'm not going to go too much into the code, don't worry about the code. It turns out Tokyo spawn has this limitation where the future you pass it has to have a static lifetime, and so that makes things a little more challenging. We can't have the function accept a mutable reference as a parameter anymore. We have to take ownership of the parameter and then we have to return a tree node. It gets a little more complicated, but let's quickly see what the run time of this is. Let's see, we did uh depth six, right? Okay, 188 milliseconds. So, something like two and a half times as long as the single-threaded version. So, Tokyo spawn not going to help you here. A lot of overhead in creating Tokyo tasks, and it is not meant for CPU-bound operations. In case you're wondering like, is rayon really better for these use cases? Yes, it is. Now, rayon join only accepts two closures, and that's useful for scenarios where you're branching two directions on each kind of level of the computation, but there's some situations where you might want to create many rayon tasks at the same time. And for that, you can use rayon scope. So, rayon scope takes a closure that gives you a scope, and you can call the spawn method on that scope as many times as you want. So, in this case, we're actually creating three rayon tasks, each of which is performing an operation on this integer. So, this can be handy in some situations. Folks, I hope this has given you a better understanding of the rayon crate, when to use it, when it can be useful, when to not use it, and a good understanding of the differences between rayon and Tokyo. If you are interested in Tokyo and you haven't seen this video yet, definitely check out my Tokyo video. I try to break down a lot of the misconceptions around Tokyo, and again give you a good sense of when to use it, when not to use it, things like that. Definitely check out that video if you haven't already. Thank you again to PostHog for sponsoring this video. Thanks for watching, and we'll see you in the next one.