# The Engineering that Runs the Digital World 🛠️⚙️💻 How do CPUs Work?

## Метаданные

- **Канал:** Branch Education
- **YouTube:** https://www.youtube.com/watch?v=16zrEPOsIcI

## Содержание

### [0:00](https://www.youtube.com/watch?v=16zrEPOsIcI) Segment 1 (00:00 - 05:00)

Inside every desktop computer, smartphone, gaming console, laptop, or practically any other device you use on a daily basis is a CPU or Central Processing Unit, and in this video, we’re going to see how they work. A typical processor for a powerful laptop like this one is built from billions of nanoscopic transistors connected together using dozens of layers of wires and is essentially the brain of the device. But before we explore the microprocessor and all its complexity, let’s travel to the early days of personal computers and video game consoles and compare the Apple 2e from 1983 to the modern-day MacBook Pro. Inside the Apple 2e we find a chip called a 6502, which is considered one of the great-grandparents of all modern processors. This chip is built from 4528 transistors and can perform around 430 thousand calculations a second. While it could only run primitive applications and video games with simple graphics, this chip was the backbone of a generation of early computers and video game consoles such as the NES, the Commodore 64, and the Atari. Compare that to the MacBook Pro’s M1 processor which is built from 16 billion transistors capable of performing around 3 trillion calculations a second, thus enabling it to generate expansive 3D worlds with immersive graphics. Despite these two chips being released around 45 years apart, the underlying principles of how they work are rather similar. In a way, you can think of these devices as sharing a common section of technological DNA. In fact, if we were to open up a desktop computer and grab the CPU or the GPU inside the graphics card, or teardown a Nintendo Switch or Smartphone and find the system on a chip or SoC, or even if you could make your way into an AI Data Center and grab a state-of-the-art AI chip, you’d find that all of these processors operate using the same underlying principles. In other words, both the Oregon Trail of 50 years ago and advanced AI algorithms run on processors with similar technological DNA, but of course, one of these chips is 10 billion times more computationally powerful than the other. So, in this video, we’re going to take apart this microprocessor and find out exactly what the shared technological DNA is and how it enables CPUs to work. And just to be clear, the technological DNA is not transistors and it’s not logic gates, but rather it’s an architectural design and basic operational principle that’s fundamental to microprocessors and differentiates these chips from other integrated circuits. So, stick around, and let’s dive right in. This video is sponsored by Brilliant. Let’s begin with a quick 3D animated teardown of this Macbook Pro. When we open it up, we find a range of different components such as the touchpad, battery cells, speakers, a cooling fan, and the motherboard in the center. Mounted to the motherboard are the solid state drive or SSD Storage chips where all your files are saved and a range of other chips. Underneath the heat pipe, we find the DRAM which is the short term working memory and central processing unit. Let’s desolder the DRAM and CPU and open it up. Inside we find three parts: on the top is a protective cover that conducts and dissipates heat, on the bottom is an interposer with thousands of connection points on either side and wires running inside of it, and soldered onto the interposer is the integrated circuit or IC which is also called a die and is the functional part of the CPU. On the die we can see the complex design of billions of transistors and wires organized into different sections such as the 4 high-performance computational cores, 4 energy-efficient cores, graphics processing cores, cache memory and many other sections. Let’s zoom in on one of the performance cores where we find that it’s separated into different functional blocks which we’ll add labels to and then reorganize into an architectural

### [5:00](https://www.youtube.com/watch?v=16zrEPOsIcI&t=300s) Segment 2 (05:00 - 10:00)

diagram. This diagram illustrates how data and instructions move around a single processing core in the CPU and, although it’s rather complicated, you’ll understand how it works by the end of this video. But for now, let’s zoom in even further to get a nanoscopic view of a massive multilayer labyrinth of wires with the transistors at the very bottom. Here we see a group of 6 transistors that are wired together to build an AND logic gate, and in this view we can see around 650 transistors out of the total 16 billion that make up the overall chip. Understanding how billions of transistors work together to build a CPU capable of playing video games, watching movies or browsing the internet will take a bit of work, so let’s start with an analogy. You may have heard that CPUs are like super powerful calculators. This analogy is only around 20% complete as it’s missing some critical parts, so let’s add them in to make a more accurate analogy. First, we’ll add a table for the calculator to sit on, along with a pencil and a sheet of paper. Next, we’ll add rows and rows of bookshelves containing thousands of books along with a cart that can carry a small stack of books between the shelves and the table. And finally, we’ll add an automated robot which we’ll call a control unit or controller. The controller can grab books from the bookshelves, move them to the cart and onto the table, and it can put them back. The controller can also read the contents of each book, write on the paper and in the books, and use the calculator. You can think of the controller as a super-fast human, but we’re bad at animating humans, so it’s a robot instead. And with that, we have all the parts for our analogy. Now let’s see how each part of our CPU analogy works. To start, the bookshelves are the storage devices in your computer, such as the SSD chips, the cart represents the DRAM, and the table and its contents represent the CPU. On the CPU Table, there’s a small space for a single open book which is similar to the very limited capacity of the cache memory inside the CPU itself. Next, the single sheet of paper represents the Registers which are used for storing values or numbers that are actively being used. Specifically, on it are four general-purpose registers and a few more special locations which we’ll discuss in a little bit. Additionally, the pencil is there to write and erase things on the paper and in the books. Finally, the calculator represents the Arithmetic Logic Unit or ALU. This ALU calculator works using binary so there are only the digits zero and one and it can do simple functions like add, subtract or multiply two numbers. The ALU calculator has many more functions that you may be unfamiliar with but are still rather simple. For example, it can increase or decrease a number by one or the ALU can perform bit shifts which is essentially taking a number and adding a zero to the end of it. In decimal, bit shifting is like multiplying a number by 10 but in binary it’s equivalent to multiplying a number by two. The ALU can also perform logic functions on two numbers such as the logical AND, OR, or Exclusive OR operations. For example, here is the logical AND operation for 2 binary inputs and you can see that the output is the logical AND for each place value of the 2 inputs. However, more importantly, the ALU calculator can perform comparisons. For example, you can input two numbers and hit the comparison button to test whether the numbers are equal to one another and, if they are, then an equals flag goes up while the other comparison flags like less than or greater than stay down. Finally, the ALU calculator’s display that outputs the result has a special name called the accumulator. So now that we’ve explained the various parts of this analogy, how does it all work together? Well, the first step is to load a program that we want to run, which is like moving a set of books from the bookshelves to the cart, and then moving a single book to the table and opening it up. It’s

### [10:00](https://www.youtube.com/watch?v=16zrEPOsIcI&t=600s) Segment 3 (10:00 - 15:00)

important to note that the DRAM cart and cache memory on the table are both temporary and limited capacity locations, whereas the SSD bookshelves can hold a lot more and are semi-permanent long term storage. Additionally, when the computer is turned off, there are no books in the DRAM or the CPU, but when the computer turns on, the cart and table are very actively shuffling books around. Let’s take a look at the contents of one of the books. Essentially, there are two types of pages: instructions and data. You can think of the instructions as the directions in a cookbook, with each step numbered sequentially across the pages. And, the pages of data contain a list of addresses with values stored at each address and are like the ingredients that go into the recipe itself. Similar to cooking, you need both the recipe and ingredients to make it work, and just a few ingredients can be combined in dozens of different ways using different recipes. But to not use an analogy inside another analogy, let’s drop the cooking one and focus on the books, table and calculator. Let’s start at the beginning of this program and flip to page one instruction one, which is called a ‘Load’ and is the most common type of instruction. This ‘load’ has us open the pages of data and find a specific address. We then copy and write down the value stored at that address into one of the general purpose registers on the sheet of paper. With instruction one completed, we move to instruction two, which is to increment the value in register zero by 1. So, we plug the value into the ALU calculator and hit plus 1. The third instruction, called a ‘store’ instruction, is used to save or store the output of the calculator found in the accumulator display into the pages of data in the same address it was in before. These simple yet very common instructions are equivalent to this line of code. Next we move onto instruction 4 and complete it and then instruction 5 and so on, moving through the list of instructions which goes on and on. In order to keep track of which instruction is the next one to be completed, we use one of the special locations on the sheet of paper that we mentioned earlier called the Program Counter or PC, also called an Instruction Address Register or Instruction Pointer. Since the PC currently has a value of 5, we find instruction 5, complete it, and increase the program counter by one. Therefore, the next instruction to be completed will be instruction 6. However, what if after completing instruction 6, we want to jump directly to instruction 42? Well, to do this we use a jump instruction at 7 which directly sets the value in the program counter to a new number and in this case it’s 42. As a result, the sequence of instructions will be 5, 6, 7 which is the jump instruction, then 42, 43, 44 and so on. A similar set of instructions is called a conditional branch which is used for implementing IF statements, loops, and other conditional code. Let’s use an example of a FOR loop with a few simple lines of code inside of it. Quite simply, this loop is used to repeat the code inside of it 4 times. Here are the corresponding instructions of the FOR loop along with the instructions for the code inside of it and we color coded each of the elements to keep track of which specific lines of code result in the corresponding instructions. We’ll discuss how compilers turn code into instruction later in this video, but for now let’s focus on the FOR loop and its instructions. Specifically, here’s where ‘i’ gets set to 0 and stored in an address in the pages of data, here’s where ‘i’ is loaded from that address and incremented by one, and here’s the contents of the loop. At the top, you can see the three instructions, Load, Compare, and Branch greater than or Equal to. The Load first grabs the value for ‘i’ and places it into register 0. Compare feeds ‘i’ stored in register 0 and a value of 4 into the ALU and compares them, resulting in the applicable comparison flags being triggered. Next, branch greater than or equal to checks whether

### [15:00](https://www.youtube.com/watch?v=16zrEPOsIcI&t=900s) Segment 4 (15:00 - 20:00)

either the greater than or the equals flag is on, and if it is, it sets the program counter to 23, which corresponds to completing and leaving the loop. However when ‘i’ is less than 4 th ose flags aren’t triggered and the loop continues, until it hits the jump instruction at address 22, where the jump sets the program counter to 6 which corresponds to the top of the for loop. As a result the loop will repeat a total of 4 times. Note that this code on the left is in C++, whereas the actual instructions completed by your CPU are in a binary language called machine code, and the semi-readable version of the instructions is called assembly, but we modified this assembly a little bit to make it more readable. One interesting note is that you may think that with everything a computer can do there must be tens of thousands of different instructions. Well actually the 6502 processor in the Apple 2e from 1983 could only complete 56 different instructions whereas the modern M1 chip in the MacBook Pro can complete 354 instructions. Here’s the list of all the instructions each chip can execute and if you take a good look, most of these instructions are rather simple. Let’s just think about that for a second. Every single thing you do on your computer can be constructed using only various sequences of 354 different instructions. However many programs have millions upon millions of lines of instructions, and hopefully, there aren’t any bugs in them. So now that we’ve discussed the range of possible instructions, let’s further explore how CPUs work. In order to complete an instruction there are always three key steps: Fetch, Decode, and Execute. The first step is Fetch and is where the controller uses the value in the program counter to search through the pages of instructions in the book for the corresponding instruction address. The controller then copies the instruction found at that address into a special location called the current instruction register or CIR. At the same time the controller increases the program counter by 1. The second step is decode, and in this step the current instruction is fed into a circuit called the instruction decoder. In our analogy from earlier, this decoder is a key part of the controller, and in essence it’s the circuitry that reads in an instruction and both interprets what the machine code of an instruction actually does, and simultaneously produces the control signals to properly execute that instruction. Specifically, this instruction decoder circuit uses the binary values of the instruction and an incredibly complex arrangement of logic gates to produce the corresponding control signals which are then sent to the different elements in the CPU. Instruction decoders are one of the more complicated parts of the CPU but here’s an example along with a simplified explanation. Let’s say we have this ADD instruction in the current instruction register or CIR and it’s fed into the instruction decoder. The first part of the binary instruction specifies we want to use the ALU. With the ALU selected, the next 3 bits specify that we want to use the ADD function, and then the last 4 bits of the instruction indicates we want the values in register 0 and register 1 to be routed and sent to the ALU. Instruction decoding is considerably more complicated than that but let’s move onto the third step which is Execute. During execute, using our example instruction, the control signals from the instruction decoder and an intricate set of electrical timing signals are used to first send the value in register 0 and then 1 to the ALU. The timing signals are used to accommodate the time it takes electricity to travel from the registers to the ALU and for transistors and logic gates to change their state, thereby ensuring the correct result at the output. After the values are input, the ALU adds the two numbers together, and a subsequent timing signal saves the result into the accumulator, thus completing the Execute step. These three steps, Fetch, Decode, and Execute are used to complete a single line of instructions

### [20:00](https://www.youtube.com/watch?v=16zrEPOsIcI&t=1200s) Segment 5 (20:00 - 25:00)

and once it’s completed, these steps repeat but using the new value in the program counter. In essence Fetch, Decode, and Execute form a cycle so let’s run through it again. During Fetch, the controller uses the program counter’s value to fetch the corresponding instruction and places it in the CIR and the program counter increases by 1. Next during Decode the instruction’s binary is fed into the instruction decoder where a complex set of logic gates generate the correct electrical control signals for that instruction. Finally, during Execute, the instruction is completed using the control signals and timing signals, and in this case, the value in the accumulator is stored back into a memory address which, using the analogy, is like writing the value from the calculator display into a data location in the book. Then the Fetch, Decode, Execute cycle repeats again using the next program counter’s value and so on. The Fetch Decode Execute Cycle is used in every processor no matter whether it’s the 6502 in the Apple2e or the M1 in the MacBook Pro. But of course there are many differences such as the size of the cache, the registers, or functions on the ALU calculator and much more. We’ll explore the exact differences in a few minutes, but for now, one important detail is that the Fetch Decode Execute cycle uses your computer’s clock to regulate its pace. The 6502 chip had a One Megahertz clock which ticked away at a million times a second and thus each step in the Fetch, Decode, Execute cycle took a microsecond. Additionally the 6502 was an 8-bit processor meaning the size of the registers and the ALU’s input and output were 8-bits wide. On the other hand, the M1 chip is a 64-bit processor, so it can handle much larger numbers and it uses a 3. 2 Gigahertz clock and therefore each step takes a third of a nanosecond. Additionally, the M1 chip, along with all modern chips, uses a technique called pipelining where multiple instructions are queued up resulting in fetch, decode, and execute for different program counter values and different instructions being completed at the same time. There are many other optimizations in modern processors that we’ll soon discuss but it’s important to understand that from the second you turn on your laptop, smartphone, gaming console, GPU or AI Server, to the second you shut it off, the processor is continuously cycling through Fetch, Decode, Execute over and over using programs filled with instructions and data along with the CPUs clock to regulate its pace. In essence this Fetch, Decode, Execute cycle is the common section of technological DNA that has powered every single processor built over the past 50 years. This cycle of steps is incredibly powerful, capable of performing trillions to quadrillions of mathematical operations every second in a single chip. But you may be wondering, are there alternatives to the Fetch Decode Execute cycle? Well, there’s a world of different kinds of microchips, but specifically, alternatives include Application Specific Integrated Circuits or ASICs such as these microchips found in this bitcoin mining computer, or Field Programmable Gate Arrays or FPGAs which are the main chips in a number automotive computers and cameras. Both ASICs and FPGAs don’t use the fetch and decode steps, but rather they perform repetitive operations by flowing data through a set pattern of logic gates and execution units, making them highly optimized, but very inflexible. And then even further from these chips, are Quantum Computers which are based on Qubits and Quantum circuits which we’ll discuss in future videos. But, now that we’ve covered the Fetch, Decode, Execute cycle, it’s important to discuss two more steps which are Memory and Writeback. Memory is analogous to moving books from the SSD bookshelves onto the DRAM cart and then onto the table or Cache Memory, and Writeback is like writing data into the books, and when space for a new book is needed on the table, the old book is placed back

### [25:00](https://www.youtube.com/watch?v=16zrEPOsIcI&t=1500s) Segment 6 (25:00 - 30:00)

on the DRAM cart and eventually returned to the bookshelves. These two steps use another special location called the memory address register and are critical to a functioning computer, but they typically take a lot longer to complete than the Fetch Decode Execute steps, and therefore in some architectures and textbooks they’re included in the cycle and sometimes they aren’t. We’re working on a separate video on how data moves around these memory locations, so stay tuned. Now that we’ve uncovered the technological DNA inside all processors, it’s important to note that, similar to the DNA found in the nucleus of the cell and there being multiple layers of biological organization and structure for all living things, there are many layers of complexity or abstraction between the Fetch Decode Execute cycle and a computer running a video game or browsing the internet. If you want to dive into some of the other layers and understand more about how computers work, we recommend you check out Brilliant which is the sponsor of this video. Brilliant has a massive library of interactive courses that include subjects like calculus, scientific thinking, circuits, programming in python, logic, data analysis, and many more topics that would take far too long to list. However, Brilliant is much more than a list of courses, rather it’s as if your favorite teacher who makes classes engaging is combined with your favorite video game and then mixed with the knowledge from countless textbooks. The result would be Brilliant. Their mission is to create a world of better problem solvers, and every one of their courses focuses on critical thinking through interactive games and lessons. Furthermore, with technology progressing faster than ever, Brilliant continuously updates their lessons to anticipate what you need to know for your education and career. For example, they have a new course on AI and Large Language Models that explains how Generative AI works far better than any other textbook or video out there. Develop your knowledge by learning a little every day. You can start today by signing up for free using the link: Brilliant. org/BranchEducation, or by scanning the QR code on screen, and you’ll then have access to the wide range of courses throughout their catalog. If you enjoy their content and decide to stay, the link in the description below will also save you 20% off an annual premium subscription, which will give you unlimited daily access to everything on Brilliant. Ok, so let’s quickly run through slightly more advanced topics to finish up this video. Earlier we mentioned that the MacBook Pro’s M1 chip can complete 354 different instructions. This set of instructions is called ArmV8. 4 and it’s categorized as a RISC architecture or Reduced Instruction Set Computer. For example, here’s a simple game of Snake using 145 lines of C++ code. It’s the job of a compiler, which is a separate piece of software, to take this code along with ArmV8. 4’s 354 RISC instructions, and generate a list of 676 assembly instructions equivalent to the machine code instructions that would be found in a book or program named snake. app The other common architecture found in Intel and AMD chips is called CISC, or Complex Instruction Set Computer and is composed of thousands of different possible instructions. For example, here’s the equivalent snake program that is compiled to run on an Intel or AMD Chip using CISC and x86-64bit instructions, and you can see it’s only 560 instructions now. A few key differences between RISC and CISC are that each RISC instruction is relatively simple and is executed at a consistently fast execution rate. Additionally, RISC architectures are more energy efficient and thus used in all smartphones, whereas CISC architectures have thousands of different instructions and pack a lot more into a single instruction. Additionally, the CISC instruction decoder is much more complicated, and individual instructions have a variable execution rate sometimes taking multiple clock cycles to execute. There are many additional pros and cons to RISC vs CISC which we’ll save

### [30:00](https://www.youtube.com/watch?v=16zrEPOsIcI&t=1800s) Segment 7 (30:00 - 35:00)

for yet another video, but we thought it worth mentioning these simplified differences here. Computer architecture is incredibly complicated with many different facets and layers of complexity and we have plans to make more videos that dive into each of these topics, but it’s important to note that each video we make takes close to a combined 1100 hours of researching, script writing, modeling, animating and editing. For example, we spent over 250 hours tearing down these non-working computers we bought from Ebay, and meticulously rebuilding each of the 3D models in Blender. So, if you could take a few seconds to like this video, subscribe if you haven’t already, share this video with someone who might be curious as to how CPUs work, and most importantly, write a quick comment below it would help us out immensely. Just a few seconds of your time helps us far more than you think. So, thank you. In the final section of this video we’ll discuss this diagram we showed earlier and the architecture of modern processors such as the M1. In contrast, the analogy we’ve laid out is rather simple, and you’re probably thinking that there must be more components in an actual CPU. In fact, this analogy is actually pretty close to what’s happening inside an Apple2e Computer. Specifically, the floppy drives, are the bookshelves, and then when we open up this computer we see the DRAM chips, which are the cart, and then going inside the 6502 processor we find an integrated circuit or die which has the corresponding sections that we’ll organize into an internal architectural diagram. In this diagram you can see the instruction decoder, ALU, the Program Counter, Current Instruction Register, the other registers and a few other sections. Specifically, here’s where the program counter is used to fetch an instruction, and here’s where the instructions and data from the DRAM chips are bussed in and out. Finally, here’s where the instructions are decoded and the control signals are generated. One note is that there’s no cache in the 6502 because the DRAM chips in the 80s were just as fast as the instructions, so the table in the analogy is even smaller. As we said at the beginning of this video, the 6502 chip is made from 4528 transistors, so let’s see what an M1 chip with 16 billion transistors would look like. To start, we have to significantly increase the size of this table. Next, we have to section off areas for each of the performance and energy efficient cores, the GPU, and other areas. When we focus on one of the performance cores we see the complex diagram from earlier, so let’s discuss how this diagram compares to our analogy. Specifically, there’s a separate set of 64 kilobyte data and instruction caches. As mentioned earlier there’s a pipeline to queue 8 instructions per clock cycle and additional sections like a branch predictor to reduce issues with conditional branching and help the pipeline run smoothly. Here you can see the pipelined instruction decoder, and 32 general purpose registers. One key difference is that the calculator is broken up into 8 separate smaller calculators each handling a few functions. Additionally, there’s a special section for load and store instructions. This is the layout of just a single core out of the 8 and there are entirely different architectures in the Graphics Processing Unit as well as inside the Neural Processing Unit. One important note is that the inclusion of these 3 types of processors along with hardware accelerators such as the media engine makes this M1 chip closer to a system on a chip or SoC than a traditional CPU. Similarly, all the processors in these devices, including the CPU in your desktop computer can be considered SoCs and therefore the difference is more a marketing term than a technical one. On a separate note it’s important to mention that the M1 along with all modern processors are proprietary designs and knowledge, and therefore the diagrams we’ve shown are close approximations that we built using input from industry experts. Let’s finally discuss our analogy in the terms of GPU chips found in graphics cards. We have a separate video covering how graphics cards work, but with respect to this analogy, a GPU

### [35:00](https://www.youtube.com/watch?v=16zrEPOsIcI&t=2100s) Segment 8 (35:00 - 36:00)

CUDA core is actually very similar in complexity to the architecture of the 6502. Therefore with 10,000 to 20,000 CUDA cores in a single GPU chip, it’s like having a massive array of 6502 cores. The difference is that GPUs typically use 32-bit ALU calculators and perform single instruction multiple thread or SIMT calculations where a single instruction is fetched, decoded, and then distributed to a batch of cores, and then those cores execute that instruction using different addresses and data. However, there are many more nuances to SIMT and GPU architecture, so let’s wrap up this video on how CPUs work. We’re thankful to all our Patreon and YouTube Membership Sponsors for supporting our videos. If you want to financially support our work, you can find the links in the description below. This is Branch Education, and we create 3D animations that dive deeply into the technology that drives our modern world. Watch another Branch video by clicking one of these cards or click here to subscribe. Thanks for watching to the end!

---
*Источник: https://ekstraktznaniy.ru/video/39623*