CS50 for Business - Lecture 2 - Designing Data Structures

CS50 for Business - Lecture 2 - Designing Data Structures

CS50

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Оглавление (16 сегментов)

Segment 1 (00:00 - 05:00)

Hello world. My name is David Malan, and today is entirely about designing data structures. Now last time we recall we focused entirely on algorithms, searching for things, sorting things, and generally solving problems. But we sort of took for granted that we could store all of those things somehow in the computer's memory, but we didn't really consider how we might do that. In fact, today what we're going to do is exactly that. Take a look underneath the hood, so to speak, of a computer, Consider how it actually stores data and how you and I. Can more cleverly store data of interest to us so as to facilitate those algorithms performing even better than they might if we used a more simple layout inside. Now let me propose that inside of your computer, or really your electronic device of most any sort, is a chip that looks like this. This is a stick of memory or RAM, random access memory, and this might store millions of bytes, even billions of bytes, aka megabytes or gigabytes respectively today. In fact, if we focus on just one of these black chips here, you can imagine, let's say zooming in on this, and let's take a look virtually inside. In fact, inside of this chip is all of these bytes, and in fact, if there's a whole bunch of bytes, stands to reason that we could sort of arbitrarily number them. Maybe this here at top left is the first byte, this is the 2nd byte, the 3rd bye, and if it's a gigabyte of memory, this might be the billionth byte, give or take, not drawn to scale. In other words, you can think of the bytes that be stored inside of your computer's memory akin to postal addresses if you will, simpler versions thereof. For instance, we might visit 45 Quincy Street, Cambridge, Massachusetts, 02138 USA, and that would lead us very specifically to Memorial Hall, the building in which Sanders Theater at Harvard University lives. Now if we were to send an envelope somehow or other, it would get to that very unique and specific address and similarly digitally. If we wanted to store a byte at a very specific location in memory, we could send it to byte 1 or 2 or 1 billion. And to be fair, computers typically start counting at 0, as we've seen. So it might really be byte 0 and 1 and 2. But we don't have to use just individual bytes because recall that 1 byte is 8 bits and the highest you can count with 8 bits was 255, assuming positive numbers and zero alone, and That's because with 8 bits you can represent 256 total possibilities, but if you start counting at 0, you can only go as high as 255. So we might want to use more than 1 byte. For instance, we might want to use 4 bytes or even 8 bytes to count even higher to store bigger amounts of data. And that's what's nice about this model for memory. We can sort of treat this grid of memory as a canvas, if you will, kind of sort of similar in spirit to what an. Images with rows and columns of pixels and that's not what this technically is because indeed inside of this chip there's no notion of rows and columns. This is just an artist's rendition of all of the bytes they're in, but it is nice in the sense that it's this canvas that we can use to just paint a picture of all of the data that we want to store inside of that computer. But at the end of the day, all we're doing is putting specific bytes at specific addresses. And so with that very simple paradigm in mind, we can be. In to stitch together, if you will, the most interesting of data structures. That is to say, a data structure is a way of structuring your data inside of the computer's memory. Now let's go ahead and abstract away the physical hardware, focus just on this virtual artist's rendition thereof, and in fact, let's zoom in because we're not going to need today all of those bytes therein. And let's consider much more simply, if this is some of our computer's memory, and maybe this again is by 012. Or something like that. Let's consider how we might start storing some data therein. What kind of data? Well, it could be numbers, it could be letters, it could be words, it could be phone numbers. It could be any sort of data we want, but we'll keep it simple, as we've done in the past initially here by storing, for instance, just numbers. Well, how might we go about storing numbers? Well, I dare say the simplest way is to use what's called an array, the simplest of our data structures, if you will, today. An array is a block of back. Back to back memory in which you can store your data. Now what do I mean by that? Well, here is a grid of just 3 bytes of memory, if you will, because my intent with this story is to store 3 values, at least initially. So for instance, if I want to store, let's say for the sake of discussion, the numbers 12, and 3, I can use 3 of those bytes inside of the computer's memory using what we're going to call an array to store them back to back.

Segment 2 (05:00 - 10:00)

That is to say, This is in contrast to putting like the 1 over here, the 2 over there, the 3 over there. What's nice about an array is by design the values are stored back to back so that you know that the second value is 1 byte away from the first and you know that the third value is 2 bytes away from the first. Though we don't have to limit ourselves to individual bytes, these could be increments of 4 or 8 or even larger than that. But for now we just have what we'll call an array of 3 values. Now suppose that for whatever purpose we want to store 1/4 value inside of the computer's memory. Maybe these numbers are not quite as trivial as 123. Maybe each of these represents a phone number or a name or some other value that you want to store inside of the computer's memory, and it stands to reason that eventually you might want to add 1/4 1. Now it's pretty obvious that according to this story, if we're using an array, that 4th value should go right there. But the catch is that I think we've oversimplified. The picture right now because a moment ago recall that these values certainly lived in the context of lots and lots of other bytes in the computer's memory. Indeed millions of them, billions of them. So I think we might need to take a step back and consider that these three values do not live in a vacuum inside of the computer. There might be data all around them. And in fact, even though we might want to instinctively put the number 4 there, might not be possible because computers are of course doing many things at once nowadays. Your phone, on your laptop, on your desktop, you're probably running at any given time multiple programs, maybe one's in the foreground, the others are in the background, but you're still running multiple programs and within a given program you're probably storing multiple values, not just a single number, but 2 or 3. You might have strings of text. For instance, we began today with a hello world exclamation, and here in fact might be inside of a computer program. The words hello world, because maybe someone typed that into the. keyboard, maybe that's the script for today's lecture, stands to reason that inside of the computer's memory at any given time there might be numbers and letters and colors and videos and any number of other things, and we don't necessarily know that there's going to be therefore available space where we want to put that 4th value because after all, if previously in the story I simply asked the computer for space for 3 numbers, it gave me space for 123 numbers and if in this. Same story, I asked the computer in whatever program I'm running for room for a sequence of characters, that is text. Well, it knew I had already been given 123 values here. Why not put H E L 00 com world right after that? That seems to be a very clean design. Unfortunately, it didn't anticipate my necessarily needing 1/4 value. I mean, heck, I should have just asked for 4 values if I wanted 4 values. Now I've sort of painted myself into the proverbial corner here. Now if the computer had been smart, so to speak, maybe it would have left room for a 4th value, but what if I wanted 5 values? It might not have anticipated that. What if I wanted 6 values, 7? So at some point the computer's probably going to do the simplest thing, which is just plop the next values right where there's space available. So certainly reasonable that this scenario might happen. Now why is Oscar the Grouch all over the screen? Well, for today's purposes, we're going to stipulate that any time I don't know or care what the value is in a and memory, we're just going to represent it as a so called garbage value. Maybe it's remnants of the program having used that memory before or the memory not having been clear or reset to all zeros or something like that. So for this story, I don't really care what those values are where Oscar is depicted. I just know that they're so-called garbage values. But I'm proposing that at this point in the story, the program does care about the numbers 123, and the phrase hello world, because those are clearly visible in memory. Long story short, Where can I go and put this 4th value? Well, the value of these Oscar the Grouches or garbage values is that technically I can use any of them because if they're just garbage, I can repurpose the zeros and ones that are there, change the patterns that they're representing in store if I want the number 1, the number 2, the number 3, and best yet, the number 4. So I've clearly got a lot of available spots into which I could put 1234, but where to do that? Well, if here are 4 arbitrary Oscar the Grouches. In a row, which means it lends themselves to repurposing these 4 bytes as an array, what I could do is copy the 1 to that first location, the 2 to the second, the 3 to the 3rd, and now I can go ahead and put the 4th value that was the whole point of this story into that 4th location. Now I don't need that original array anymore, so I can go ahead and just give that memory back to the operating system, so to speak, and allow it to be reused eventually for other purposes entirely. So that then is an array, a block of contiguous memory that is values back to back

Segment 3 (10:00 - 15:00)

but that's a key defining characteristic. No matter how many values you want to store in array, you must keep them together according to that definition, which means if you want to grow the array and there is not space available for that additional value, you're going to have to relocate the array altogether. Now I arguably got lucky because there was plenty of. Garbage values around me so I could repurpose most any of those in order to store the 4 values, but that's not necessarily always going to be the case and problematically too, even though I told that story fairly quickly, just copy the 1 to the 1, the 12 to 2 to the 3 to 3, and then add the 4, well, that took work. It took like 3 steps plus the 4th step to add the 4. That was a good amount of work just to add one value. I had to copy the entire array, so could I maybe do better than that? Well, it turns out you can using this very same canvas of computer memory if we think a little harder about how to treat all of those individual bytes. And in fact, the second data structure we'll talk about now is a so-called linked list, which is an alternative to an array, but as is too commonly the and computer science and technology more generally, there's going to be trade-offs. We're going to have some advantages here, but potentially some disadvantages as well. So let's rewind to that canvas of memory, which is just a whole bunch of bytes that we can treat however we want, and let me propose again that I'd like to store the value one initially. But for now I'm also going to reveal that indeed each of these locations in memory have addresses 1, 23, 012, however you want to label them, 45 Quincy Street in the real world. But let me propose using some standard terminology that this value 1 is at location, oh, I don't know, OX 123. Now the OX is a little weird, but that just indicates that I'm using what's called hexadecimal notation, which is base 16, which is Different from base 10, aka decimal, and different from base 2, aka binary, and it's not necessary, but this is conventional. When talking about computers memory, it turns out that computer scientists and programmers tend to think about addresses and look at addresses in hexadecimal notation, but I still chose a simple value. This is at location 123. This is not the value 123 per se. This is location 123 at which the number 1 currently lives. So suppose I now want to store the number 2 in this list of numbers. Well, I could just put it here and put the next number there, but again, per our previous story, let's suppose that there's something in the way. The memory immediately to the right of that number 1 is maybe already in use by something else, but that's OK because if I don't hold myself to the constraints of an array that values must be back to back. I could kind of just use this really as a broad canvas and put the number 2 wherever I want. For instance, if the first available space in the computer's memory is way over here, which is not contiguous with the number 1, it's not back to back with it. That's OK because I can just observe that second value happens to be for the sake of discussion at location OX 456. Again, I made it up so it's an easy to pronounce number, but OX 456 might be where the two now lives in memory. However, unlike an array, it is not the case that the second value is 1 byte away from the first value. In fact, according to this picture, it would seem to be 123456789, 10 bytes away, which is kind of arbitrary and not necessarily predictable. It just so happened that 2. Uh we had room for it right there. In fact, let's continue the tale. Suppose that the next available spot for the number 3 is over here at location OX 789, and again, the picture is not to scale because there's more than 10 bytes between these addresses. OX 789 is where we might put the 3, but there are two. It's not necessarily obvious where the 3 is relative to the 2 or relative to the 1 if that just happens to be the first available spot. But what can I do now? Well, if this is just a canvas of memory in which I can store bytes and bytes in turn can be thought of as numbers, just patterns of bits that represent decimal numbers or whatever, maybe I could use a few more of these bytes to kind of stitch together these values in memory. Or if you prefer metaphorically, maybe. To leave some bread crumbs that lead from one value to the next. Or I kind of like the mental model. If you're remember in olden times you might stitch popcorn together using a thread around a Christmas tree at the holidays. Well this too might be a way metaphorically where we could kind of stitch these numbers together. By using their location, so let me propose this. Let me get rid of some of the distracting bytes that we're just not using in this story, but let me propose that for each of these three values

Segment 4 (15:00 - 20:00)

we use one additional byte for what we're going to call metadata just to keep track of where the next value is. So again, nothing has changed yet. The one is at a location. In OX 123. The 2 is at location OX 456, and the 3 789. What if I, if you prefer yet another metaphor, create a sort of map for myself, a treasure map that leads from one value to the next? Well, if I instead treat each of these numbers as taking up not one byte but two, let me use the second byte. As a pointer, if you will, to the next location in memory. So the first byte will contain the actual number I care about, the data, like one, but the second byte now will contain the address of the next value I care about, which is going to be. 456. And if you think about this now as a sort of treasure map, OK, OX 456 means the next value is at that location here. Well, how do I get from this second location to the third? Let me use the second byte for the second value to store the number OX 789. And OX 789, if I follow that pointer or address, if you will, it's going to lead me to the third value 3 which is again at location OX 789. And I'm deliberately using a term of art here, pointer, which is indeed how programmers would refer to these values insofar as they're sort of pointing from one location to another as depicted here by my hand. Now, as for the 2nd byte here, technically I don't really need a 2 byte because there's no 4th value yet, but to keep this data structure standardized such that each of these chunks of memory are the same functionally, I should at least allocate that 2 byte even for this number 3, even if I don't plan to use it. Now, even though I'm not going to use it, there's still 1 bye or 8 bits here. What should Those 8 bits be permuted as well by convention we're just going to make them all zeros, which will represent with the value OX 0, which just means 00000 000, a sort of sentinel value such that if the computer ever sees a 0 byte, otherwise known as a null byte, it's sort of a terminus in that path. That's it. There's no more values in this here list. So this is a All cryptic to look at and frankly no programmer, no computer scientist is really going to care too much about what the specific addresses are. After all, the theme in computing is generally that of abstraction whereby we shouldn't have to care forever about these low level details like the specific addresses at which these values live. So it's conventional to abstract those specific addresses away and simply indicate on, say, the whiteboard on which you're designing this data structure. That this so-called linked list does have an address that is the address of the very first byte, which I'll claim is again OX 123, but at the end of the day all we really care about is that we've constructed a list of three values irrespective of the specific and uninteresting addresses at which those three values live. So we're going to draw it indeed with arrows pointing from one value to another in order to indicate visually in this case that this is indeed. Visually, pictorally, a linked list of three values, each of which we're going to conventionally call a node. So this node contains 2 bytes, 1 piece of data that I care about, and if you will, one piece of metadata, which is information that really Helps you maintain the data you care about. In this case, that metadata is the address of the next node in memory, but we're going to make sure we remember where all of these values in memory are by storing one special pointer, that is the address of the very beginning of the list. Now it turns out that suffices. I don't have to remember forever the address of the 2nd byte or the 3rd byte because I can always figure that out from these nodes in memory, and that is to say, if you give me the address of the very first byte in memory. And let me rewind before we had the arrows. OX 123 will lead me to this node. OX 456 note. OX 789 will lead me to this note, and OX 0, aka null, will indicate that this is the end now of the list. So what's the upside here? Well, among the advantages now of these linked lists is that we're no longer bound by the simple but nonetheless constraints of arrays whereby your values must be by definition back to back in memory, and that was limiting in. As far as there might not be available space right next to the array and memory, which if nothing else is going to create a whole bunch of work for me because as with our simple example earlier I had a copy the 1, the 2, and the 3 to a whole new location in memory just to add that fourth value. Now in the world of linked lists, I dare say if you wanted to insert the number 4, well you could plop it over here

Segment 5 (20:00 - 25:00)

over here. It doesn't really matter because you could sort of pictorially draw another arrow from the 3 to that new value, or more specifically you could change that null byte OX0 to be simply the address of that 4th additional value wherever it is in memory. So link lists offer this sort of dynamism where you can grow them and heck, you could even shrink them without having to impact all of the existing values as by copying them from one place to another. In other words, it would seem to just give us the ability to grow and shrink this data structure without wasting time really ever copying values around. But there might very well be a downside, and in fact that again is thematic in computing such that any time you sort of get a win, you're going to have to take a loss somewhere else. There's going to be a trade off that often involves time or space, and indeed we saw that with our discussion already of algorithms. So let's consider for instance, what the efficiency is of or the cost is of inserting a new value into a linked list as designed here. And let's consider our little cheat. of common running times for algorithms. This is not exhaustive, but it's representative of some very common running times as we've seen big O of N2, and log N, and 1. And the last of those represented constant time, which is ideally the fastest, whereas O of n2d could take us quite a few steps. Well, here is the beginning of a linked list, and it's initially empty because it's not pointing at or storing the address of any actual nodes of value. So we're beginning at the beginning. How now might we insert the first value? Well, if the first value is one, maybe there's some available space over there, and so we can draw an arrow from this variable, if you will, this placeholder for the beginning of the list that leads to that location and memory. So I don't know where it is. It's like OX something, so I would technically store that address in this box, but that just means pictorally that it's pointing to this location in memory. All right, what if I insert another value into this length list? How might I do it? Well, I dare say the simplest way to insert a new value into this list is to just kind of squeeze it into the beginning and so I could put 2 there. I could then put a third value right over there and notice the 1 has not moved, the 2 has not moved. I've just added the 3 wherever I can. And even though it might take a few steps and if we were to type this out in pseudocode or actual code, it would take indeed a few steps to maintain these arrows and these addresses. But the advantage here is that the one's not moving, the 2's not moving, and heck, if I add 1/4 value, the 3 is not going to move either. But notice what has happened. If I care about maintaining these elements in sordid order, say in increasing order 123, I've accidentally built this linked list in reverse order. Now, obviously our human eyes can easily see that well, you could just traverse the list in the opposite order. But that's not good enough because how do you get to this last element of the list? Well, you have to start at the beginning and follow the arrow, and that's going to take a bunch of steps just to get to the beginning. And because these arrows, as we've described them, only point in one direction, so to speak. That is to say, we're only maintaining the address of the next node, not the previous node. I'm going to have to go back and forth through this list again and again to find the first elements I inserted, then the next element, then the next element. That's a lot, a lot of steps. However, it's pretty darn efficient to do the insertion itself. Not the subsequent searching or traversing of the list, but it's pretty darn efficient. Why? Because all I have to do to insert a new node is find some space for it, point it at the existing list, and then point this box at the new node, and that's going to take what, 3 steps, give or take, total, but that's constant time. Which is generally good because it doesn't matter how many nodes are already in the list. I can with just 3 steps, put another one in. Put another one in, Downside, of course, is that they're ending up in some random and in this case reverse order because I'm not inserting them deliberately in sorted order. So among our little cheat sheet here, what might the running time of this implementation? Of inserting B, well, it's going to be constant time, and that doesn't mean literally one step, that would be super fast, but it does mean representatively a constant number of steps, and I'm spitballing that it might take some 3 steps in total. All right, so that's not necessarily the best way to do this, because indeed we have this side effect that we're accidentally putting things in reverse order even though we're sort of making a mess of things pretty darn fast. How else could we do this? Well, I could instead start with an empty list and I could begin to add the number 1. Suppose I want to then insert the number 2 and suppose to keep it simple still. But different from the last algorithm

Segment 6 (25:00 - 30:00)

what if I insert the next node at the end of the list and that list? That might by chance give me sorted order 123. But had I inserted different numbers in a different order, I might be appending them, so to speak, to this link list such that they still end up in sort of a messy order. So let me propose that we actually consider now that the running time of that algorithm. Always inserting it at the end of the list is actually going to be worse than the first approach, because to find the end of the list I have to traverse the list, traverse the list every time I insert a new node. Now, to be fair, if your mind is going here, I could use some additional memory to just keep around the address of the most recently added node, effectively keeping a pointer to the end of the. List to obviate the need for going through the list again and that's totally fine, but it's a little different algorithm. So I'm going to keep it simple now just to compare and contrast these simplistic implementations of inserting where we insert at the beginning, otherwise known as prepending, or we insert at the end, known as appending. And the first running time we claimed was big O of 1, constant time. The second algorithm appending is going to be linear time, big O of end. If you've got n elements in the list, it's going to take you that many steps to find the end, assuming you don't use some additional memory to remember it in perpetuity. Well, let's take one other approach whereby we consciously insert elements in sorted order. And now depending on the elements I want to insert, it might end up at the beginning of the list, the middle of the list, or the end of the list based on the value in question. So Suppose I insert the number 2 1st, but I want to maintain sorted order, and then I want to insert, say, the number 1. Well, it doesn't matter where the one goes in memory, but it does matter that the arrows point in the appropriate direction. So here I sort of got lucky in that 1 is smaller than 2, so it really belongs at the beginning of the list, and we saw earlier that prepending to a list is pretty fast, constant time, 123 steps to update all of the arrows in question. Well, I got lucky here, but what if I next want to insert, say, the number 4? Well, that goes in terms of its values all the way at the end. So that's going to now be big O of N and steps to find the end of the list. And then if I insert the 3, that's going to get a little trickier because I have to now find the right location for it, which again means traversing the list, looking for a value that's less than the 3 and greater than the 3. And then putting it there. And there's other scenarios too because if it's already the largest element, it's going to end up at the end. If it's already the smallest element, it's going to end up at the beginning. But we already saw those scenarios. So in this case, the 3 ends up, if you will, in the middle of the list. So inserting here 2 would seem to take again linear time. And so here's already a disadvantage of link lists, at least as we've described the insertion algorithm. Them whereby it's taking linear time and steps to add new values to the list, whereas in the array scenario, it turns out it's conventional that you just kind of plop it at the next location because by nature of arrays you do keep track of the size of an array and therefore where the end of the array is and so you have really what's called random access to arrays. That's actually the origins of the term RAM. Random access memory means that you can jump. To the location where you want to loop the value with link lists, we've created a bit more work for ourselves because we have the value, the upside of being able to find memory wherever it exists, but then we then need to stitch everything together and find all of those locations potentially again and again. And deletion in fact is no better. For instance, if here is the list of 4 values and we want to delete any one of those values specifically. We're going to have to find it first. So really deleting is sort of equivalent to searching for a value because you have to search for it before you can delete it. And once you've deleted it, you have to then update these arrows, which is a bunch of steps too. So that too I would claim is going to be linear time as well. Meanwhile, searching, of course, gives us the same problem because even if you don't plan to delete anything, you've got to find a value by definition of searching for it. So can we do better than this? It would seem that we've gained dynamism by introducing linked lists because we can grow and even shrink them much more easily than arrays insofar as we don't have to copy the original array to the new array location, which itself might be expensive, but we're still wasting time in other ways by having to traverse the list. So it's almost kind of a wash. Moreover, With a linked list, let's go back to the diagram. If you wanted to search for some value, you can't use so-called binary search, whereby you go to the middle and then the middle of the middle and why. Well, even though you and I have this sort of bird's eye view of this picture and can kind of ballpark that the middle is like roughly over there, you don't know how to get to the middle unless you start at the beginning of. List count all of the elements therein, then do it again and go to the halfway point and

Segment 7 (30:00 - 35:00)

then the halfway point of the halfway point and so forth. Now again, you can use some additional memory to sort of remember that in perpetuity, but that's a lot of memory if it's a really big list. So it doesn't feel like we've made progress when it comes to the efficiency of storing values in memory. We've just given ourselves some flexibility when it comes to Finding space for more and more values. So can we search more effectively and gain back the speed of binary search whereby I was able to divide and conquer the problem in half and half and half? Well, I think we can if we take yet another approach to our computer's memory and treat that canvas as though we can build trees on it. Now what do I mean by this? Well, let me propose that we explore yet another data structure, namely a binary search tree, which is kind of like a family tree, but instead of having members of your family drawn in the computer's memory, you just have the values we're talking about in this case, but in a particular order. So for instance, let me propose that we're storing initially 7 values, and for simplicity, let's propose that these 7 values are currently in an array. It would look quite like this, and it's an array by definition because all of the values are back to back. Recall that when you have these values back to back, you can very easily jump to the middle element because if you know the index or address of the first element and you know the length of the array or the index or address of the last element, you can literally subtract one. The other divide by 2 round appropriately and boom, you can find the middle elements. Similarly, of the middle just by doing a bit of arithmetic as well. So this is to say when we talk about random access where you can jump right to a location, it doesn't mean random in the sense that you can go to any location that you don't intend to. It means that you can. Jump in constant time to any specific address in memory, thus RAM, random access memory, and arrays really thrive in that environment. You don't have to start at the beginning to find the end. You can do a little bit of arithmetic to find the end or even the middle because you know that every value is adjacent to the next. So can we gain that efficiency by but still give ourselves the dynamism of linked lists where we can grow and shrink them by just using any available memory without having to move things constantly around? Well, let me propose that we think about this array, maybe not in one dimension. But maybe 2. So in fact, let me color code the middle element first. of the middles, and then the remaining elements here, the middles of the middles, if you will. And let me propose that at least in your mind's eye we redraw these 7 values in two dimensions, adding some verticality. So the numbers have only moved up or down. They've maintained the same. Order from left to right and let me propose that now each of these numbers is just yet another node, a chunk of memory that you're using to store some data and maybe metadata, but this time for metadata, let's not just store the address of the next value. Let's use 2 pieces of metadata, 2 bytes if you will, 2 pointers arrows if you will. To keep track for every element which element is smaller than it and which element is larger than it, and that's the key definition of a so-called binary search tree that for any node you look at, it's left child, so to speak, borrowing the family tree metaphor is going to be less than it. And its right child is going to be greater than it, and that's true of the subtrees, for instance, relative to the number 4, not only is the left child smaller, but the left subtree is smaller. Every element to the left of 4 will be smaller, even the 3. Every number to the right of the 4 is going to be larger than the 4, including the 5. So that's the definition of a binary search tree. You maintain that particular order. Now, even though I haven't drawn it to scale, you can think of indeed each of these notes as taking up 3 bytes, one for the data you care about, the actual number like 41 for the. Address of the pointer to the left child, and one for the address of the pointer to the right child. So each of these boxes and two arrows collectively represent 3 bytes. And again this is what I mean by using your computer's memory as a canvas. At the end of the day, all we have are 0s and 1s, and in turn bytes, units of 8. So I can Use those really however I want, and I am choosing now to use those in groups of 3 bytes in order to create effectively this diagram. Now this diagram is not going to be laid out literally in this way, top to bottom, left to right, inside of the computer's memory. These things could be anywhere

Segment 8 (35:00 - 40:00)

but that's what the arrows are for. They're going to point at the locations in question. Now what's the value of designing this tree? This binary search tree in this way, I think I have magically gained back the ability to do binary search, why? Well, so long as this binary search tree is ultimately defined by its root, the topmost element, and that is how you access all of the other elements. Watch how many steps it takes me to find any number. Suppose I'm searching for the number one. Well, obviously with our bird's eye view you can see it, but the computer can't if the computer starts searching for the number one at the root of this binary search. I it's going to see the number 4, but it can use a conditional recall in turn a boolean expression to ask itself, is the number I'm looking for less than or greater than or heck equal to the number I'm looking at? Well, one is obviously less than 4, so the computer can search down this way and ask the same question again recursively, if you will. Is the number I'm looking for. less than, greater than or equal to this value. Of course the number 1 is less than and so we go down this way. Notice we have not looked at this element or this element. In fact, we've sort of divided and conquered this problem by chopping the tree, if you will, in half and half. That is to say, when I go left here, it's kind of like. Destructively snipping the tree there and not caring about anything in that direction. Similarly, when I go this way, it's sort of like snipping the tree here and not caring about anything in that direction. And we've sort of seen this metaphor before and it too was fairly destructive when I tore the phone book in half in half, I didn't care about half of the book to the left or right again and again. Here too, it's sort of the same thing, of course it's not destructive, these elements in memory aren't going anywhere, but we are saving time by not looking at half of the tree, half of the subtree, half of the subtree. And so forth. So what do we gain from this? Well, it would seem that the running time now of this algorithm is now log of technically log based to of n, because I'm effectively dividing the problem and again in half by traversing it in that way. Now what's the trade off of having done this, because that seems like a win. I now have the ability again to do binary search, which was one of the fastest algorithms we've thus seen. I no longer suffer from the slowness of having to follow all of those arrows just to find the middle or the end elements and so forth. So that seems to be a win. But what's the downside? Well, it would seem that. There is a trade-off. I'm saving time. But I'm losing space if you will. Indeed this tree takes up even more memory than the data structures we've seen thus far. Arrays were super efficient. All we stored was the data, the values 123, and even 4 that we cared about back to back. There was no metadata. There were no pointers, there were no additional bytes just to maintain that data structure. Link list though, like literally overnight doubled the amount of space we needed because we needed as much space for the data, the values we cared about 1234, but we also needed. Metadata bytes to store OX 123 OX 456, OX 789, so that was twice as much space. Somehow or other, you've let me spend 3 times as much space with this data structure. Why? Because for every value like 4, we now have two pieces of metadata, the left child and the right child or technically the addresses thereof. or pointers there too. Now technically there's a little bit of savings at the bottom here insofar as the leaf nodes, the grandchildren, if you will, don't actually have any great grandchildren themselves or children themselves. But per my comments earlier, we really want every node to be structured in the same way, even if we're not using those two children. pointers left and right. So technically even these nodes are taking up 3 bytes in this story. It's just that 2 of those 3 bytes are null OX 0, which indicates again that there really is no arrow there. It's not pointed to anything at least as of now. Maybe there will be because if we insert like the number 0 into this, it might actually belong as the left child of this here number 1. So it takes up more space to use this binary search tree than it did a linked list and than it did in array. What other downsides might there be? Well, I've showed you this binary tree already built, but how did it get built up in precisely this way? Let's consider building this tree step by step and see if there's a bit of danger there too. For instance, if we clear the board and begin again with inserting just a single number like 2 this time, that might So at the root of the tree, it has no children

Segment 9 (40:00 - 45:00)

so it's a fairly simple binary search tree. It doesn't look like much of a tree, but it still meets the definition because it's greater than any of its left children, which is not applicable, and it's less than any of its right children, which is also not applicable, so it doesn't violate any of the terms. Suppose we then insert the number one. Well, ideally it would go over. Here because that's less than the 2, and I think we're still good. Still doesn't look like a tree, but it also doesn't violate the definition thereof. If I then insert the number 3, it's going to go numerically over here is the right child because it's greater than the number 2. So it would seem fairly straightforward when building a binary search tree that it all sort of works out perfectly. But that's not necessarily going to be the case. I chose this order 213 deliberately to show that sometimes we can get lucky and lay out the memory exactly as we want the first time around. But consider a more perverse case, if you will, a worst case scenario in which the insertion order happens to be inserting the number 1. So far so good, and then the number 2. but then the number 3. I don't like where this is going now. Technically this is a binary search tree because 1 is greater than anything to the left, even though that's not applicable. 1 is less than everything to the right. 2 right of it and greater than nothing over here. So it technically meets the definition of a binary search tree, but as you can kind of see, especially if we continue to insert 4 and 5, so ironically in this case inserting numbers in. Uh, increasing order is actually a bad thing because this binary search tree is going to devolve, if you will, into what kind of looks like a linked list, you know, it's not quite drawn in a way that fits on the screen left to right, but that was just an artist's rendition thereof. It's going to just get long and stringy, which is really just gonna be a linked list from smallest to largest, even though I might be drawing it in this way. Now that's fine. Like that's still a connected data structure. You can still search for values, but you can't search for them very efficiently because it's so long and stringy. But there's a correction here, right? Like, even with 3 values, I could make this look more like a tree if I balance it by kind of pivoting everything like this. So the 2 should eventually become the new root, 1 should become its left child, the 3 should remain its right child, and then I think I'm off to a better start. In other words, If my algorithm for inserting nodes is a little more intelligent and as soon as I notice that the thing is devolving into like a long and stringy link list, fine, do some kind of pivot to correct that so that the problem doesn't get too bad too quickly, and that's fine. And in fact there are other data structures that we won't talk about in this particular class that do exactly that. They have additional. Steps, additional code if you will, that tell the computer how to maintain a balanced binary search tree, which this at the moment is not. But to do so it's going to take a few more steps, but it turns out that's not terribly many steps, and we're still going to be able to retain ultimately the logarithmic running time of searching. And deleting in and inserting into a binary search tree even if we maintain balance just so happens that the algorithm for doing so is a little more involved. This of course is just a pretty simple picture to pivot this over to the left. So whereas a binary. The tree could devolve into linear time big event, not if we are smart about it and we maintain a balance to that tree, as by changing what the root is as soon as we see that the situation is devolving so as to maintain more of an actual tree-like structure. So funny enough, it seems as though this is how we could be in our phones, be it iOS or Android, storing our contacts. We could be storing them in. A binary search tree, better yet, a balanced binary search tree, the nodes of which ultimately store things like John Harvard's phone number and maybe email address and other information. It doesn't have to just be simple numbers like 1234. We can actually store, of course, strings of text and more information as well, but we can therefore store all of our contexts in a way that allows me very efficiently to install. someone else to delete someone if I don't want them in my phone anymore without having to use a simplistic array which for call might result in iOS or Android having to constantly copy all of your contacts from one big chunk of memory to another just to add one more person. But when you use the autocomplete feature inside of your contacts app, you ideally want to be able to find someone quite quickly as via binary. Search and not something slower like linear search. And so with binary search trees alone, I think we're pretty on our way to getting the best of both worlds dynamism, dynamic insertion, and deletion that doesn't require linear time necessarily. In fact, it looks like we could achieve such in logarithmic time

Segment 10 (45:00 - 50:00)

and searching can be done in logarithmic time. That is to say, big O of log n. So we're back in business even though there's still a trade off. What was that trade off? Well, any time we're saving time, we're probably spending something else. And indeed, even though we're saving time, we're using more space and maybe that's fine. And thankfully nowadays space is getting less and less expensive. The memory you can have inside of your phone and your laptop, your desktop is getting less and less expensive, so maybe that's a reasonable tradeoff. Then again, there's other trade-offs, and this doesn't. Really apply to us as the users of these devices, but the programmers at Apple and Google and beyond probably had to write more complicated code, more sophisticated code, fancier algorithms to implement the binary search tree than a simple array and even maybe a simple link list. So human time, developer time, development cost might be yet another resource that we want to factor in when it comes to making decisions, for instance, at Apple and. which algorithm should we do? It might be faster for the user, but if it's going to take the programmers lots of time, lots of time to do it and therefore cost, maybe it's not the right call. Then again, if you're fortunate to have millions of users, having one team member or a small team work on a better feature for longer, you can amortize that cost, of course, over the many, many uses of the algorithm itself. All right. Seems like we're in a pretty good place. Are we done? Well, not necessarily. Lo and time is pretty darn good logarithmic time, but the best thing we've seen on our little cheat sheet is that so-called constant time, big O of 1. Can we get closer to constant time because that's kind of the holy grail of an algorithm if no matter how much data is inside the computer's memory, you can still find something like that one step or maybe 2 steps or 3 steps, but some number of steps that has no dependence on the number of values already in the data structure. So you know. What, let's get a little crazy here and let's see if we can maybe borrow some ideas from arrays and link lists and kind of mash them together, kind of like we did with binary search trees, but let's build something even fancier inside of the computer's memory, namely a hash table. So a hash table is a data structure that if we do this well is going to get us closer to constant time. Well, what could we do? Well, let me propose that I'm going to start with an array. I'm going to draw it vertically just so it fits nicely on the screen, but that's just an artist's rendition thereof. It's equivalent to an array that goes left, right, because again, that chip in memory has no notion of up, down, left, right. This is just how we are addressing things. But notice if you count these up, I've deliberately chosen a Array of size 26. Why? Well, for this story, let's move away from storing simple numbers like 1234, and let's actually store people like John Harvard. That is contacts in your address book, which might include a name and then turn a number or an email address or whatever other pieces of data you keep in your phone. So why 26? Well, maybe I could store everyone whose name starts with A Z down here, and then all of the letters in the English alphabet in between, and you can do this for any human language. Now that's all fine and good, at least initially. For instance, if I draw some inspiration from Uh, maybe the world of Nintendo, we can store the names of Nintendo characters in the computer's memory as follows. Each of these locations can of course have an address associated with it or an index 0 or location 0 index 25. What happened to 26? Well, again, anytime you start counting at 0, you only go as high as n minus 1 if N is 26. Therefore, the large, the lowest elements on the board here is. Going to be 25. All right, now you can conceptually think of each of these elements in the array as representing A through Z accordingly. So in this case, A is not in fact 65, B 66 because I'm not using AI or more generally Unicode here. I'm really just treating the first box, the zero box, as a placeholder for A and the 26th box location 25 as a placeholder for the Zen name. All right, let's go ahead now and start inserting some names. It's a Mario here in the middle. Well, that's going to go at the M location roughly here. What's that? 0123456789, 1011, 12. It's the 13th letter M, but because we start counting at 0, it's at location 12. So all is good now. What's nice about this data structure? Well, if I subsequently start searching for Mario in my phone and type. In M, the phone knows immediately go to location 12. Why? Because if it looks at just the first letter of Mario, it's pretty easy to figure out that the letter M is the 12th letter of the alphabet because you could write code to do that and you can jump randomly but specifically to that location. You don't have to start at the top looking for Mario.

Segment 11 (50:00 - 55:00)

You know that Mario, first name starting with M, is going to be at location 12 no matter who is above or below him. Let's insert another friend into the phone now. How about Luigi? Luigi goes at location 11, which is one before Mario, because L comes right before M. So far, so good. Maybe next we insert peach. Peach goes down here at the location that corresponds to the first letter of her name, P. Now if we continue with we're in pretty good shape. We can store Daisy and we can store Wario and Zelda and Yoshi and all of these other friends of ours in our phone. And if I subsequently want to search for any of these characters, the algorithm I've implemented can simply look at the first letter of that name and jump immediately to that location. There's your constant time, because I have a fixed length array, size 26, and I've associated in my algorithm here numbers with each of those locations, 0 through 25, and because I can certainly look at the first letter. Of a string of texts, someone's name, and I can see that letter and I can convert it to a number. For instance, here's where the 65, the 66 actually does come in handy when I see a name that starts with B like Birdo here, I know that B in Unicode is the value 66, but I also know it's the second value in my array, so I can essentially subtract. 65 from the Unicode value of the first letter of someone's name to immediately get the right location. So here's where it's so wonderfully useful that letters have been standardized as having corresponding decimal values in both Unicode and previously ASCI because you can do some very simple arithmetic in so called. Constant time to convert those Unicode values to the corresponding values from 0 to 25 here. So in fact, to insert each and every one of these Nintendo friends takes me constant time. Maybe not one step, it's probably like 2 or 3 because arrays indeed give me this random access based on their arithmetic indices 0 through 25. So this is great, and this is a hash table. A hash table is essentially an array that you use to index into to store values at specific locations, but it's not all a hash table does because what if I've got more friends than these and I have to store a second name that starts with L and a third name. It starts with L. Well, where does link go? Well, I don't want to put link, for instance, where Luigi was because then I'd have to delete Luigi just to add link or anyone else whose name starts with L, which doesn't seem particularly compelling. And so what I'm done here pictorially is I've proposed that, OK, Link and really anyone whose name starts with L belongs at this location 11. Let me store that person and not exactly at that location necessarily but in a linked list at that location. So as implemented here, a hash table can be thought of as an array of linked lists, and those linked lists exist. The sole purpose of managing collisions, a collision in a hash table is when two values, both 2, so to speak, the same location. L is going to go to 1111, and unless you're willing. To just plop that person somewhere else seemingly randomly in memory, or you're willing to delete whoever's there to add that person, you want to probably just start linking them together somehow. So on the one hand, this is a wonderful solution because it means I can store as many as 3 or more names that start with L. The downside is this is devolving again. Now my hash table is sort of devolving into a linked list based on This design because what about in the most perverse case I only have friends whose names start with L. My hash table is going to have 26 locations, only one of which I'm using, location 11, and it's just going to be one massive link list there of all of my friends whose names start with L. Not the best design. Now in this case, that hasn't happened yet, but even in the world of Nintendo, there might be a lot of these so-called collisions that I can solve by using these. Means these length lists, but again my data structure is moving away now from a running time akin to constant time big O of 1, starting to approach big O of N. And in fact, if we often consider the worst case scenario, especially when N gets large, in the worst case I only know people whose name starts with L or any letter of the alphabet, this is going to evolve entirely into a linked list. So technically, if we analyze the running time of this hash table's implementation. It's technically going to be big old event. Now in practice, and let's veer a little away from theory here, in practice, am I really going to be friends with only people whose names start with L? Like most likely not. I'm probably going to have friends whose names also start with

Segment 12 (55:00 - 60:00)

M and P and B and many letters of the alphabet. Maybe not all. I don't have that many friends whose names start with in English Q or X, which tend to statistically be less common. Now if we assume a Uniform distribution of friends, so to speak, to add some statistics to the world of friends. Well, it's not going to be quite as bad as that. In fact, we might really have big O of N divided by K, where K is some constant like 26. So ideally if we have a uniform distribution of friends over the English alphabet based on the first letter of their names, each of my linked lists or chains should be maximally N divided by K, where N is the total number of people in my phone, where and K is in fact 26. But again, recall that when we discussed this asymptotic notation big O and omega, we only really cared about the highest order terms, not the denominator or constants in this case. So really now I'm sort of nitpicking big O of N divided by K is really still just big O of N because if you've got lots and lots of values, it doesn't matter if you divide by 26, there's still going to be a lot of values in each of these linked lists and therefore they might still be slow. So we're drifting again, it seems, from our constant time. So the source of this problem though, this performance impact seems to be. The fact that we're getting these collisions. I've got multiple people whose name starts with L and a bunch of other letters of the alphabet. So how can we reduce that probability because if I don't have those collisions, I'm not going to have these linked lists and therefore this performance slowdown. So what if I do something like this instead of having an array of size 26, what if I use a bigger array and look at not just the First letter of someone's name, but maybe the second and the third. I bet it's less likely that I'm going to have two friends whose names start with the first three letters of their name. Not impossible, but statistically much less likely. So we could put these same 3 friends whose names start with L in this way in the array, and this array is not going to fit on the screen because it's going to have thousands of box. is now, but you can imagine this box represents L A B L A C L A D all the way down to maybe L U J here after L U I. Now what's the advantage of this? Well, if I take the time to look not just at the first, but the first three letters of each of my friends' names, then I can come up with a location in this array. That probably doesn't have any collisions and indeed these three friends now have their own spots. Well, what's the tradeoff? Well, here the tradeoff is pretty obvious to save time and reduce collisions, I'm using a lot more space. Now I'm using 26 times 26, which is going to give me many more boxes now even though I haven't drawn nearly all of them on the screen. I'm sort of waving my hand with some dots here instead. So if you're willing to spend that much more memory, great, but notice the other trade off that's implicit here. I don't have any friends apparently whose names start with LAA and LAB and LAC, and I can think of other weird combinations of three letters that just are not going to represent human names, at least in English, so I'm wasting a huge amount of space now, whereas I wasn't really wasting nearly as much space when I only had 26 positions. So on the one hand you can do it. On the other hand, there's a trade off and at the end of the day it's a design decision. It depends on what is more important or perhaps equivalently what is less costly for you to implement. All right. But that does give us the sort of holy grail of big 0 of 1, but not quite. There is still a chance that there might be two friends or more friends whose names do in fact start with the same first three letters. So maybe hash tables aren't quite where we want to end this tale. In fact, let me show you one other data structure that maybe really literally gives us constant time access, but we'll see what price we pay. And this is compelling because of this. Same picture. Recall that when we analyze the running time of say those 3 algorithms with which we tore a phone book in half, the 1st 2 algorithms were these straight lines because there was a 1 to 1 or 1 to 2 relationship between the number of steps and the number of pages in that their phonebook. My third algorithm, which we now know as binary search, was logbased 2 event, and indeed that was a very different shape. What we're really trying to get to is an algorithm whose running time is constant, represented by K, whereby no matter how The problem is the solution there too takes the same amount of time. K steps where k might be 1, might be 10, might even be 1000, but it is independent of N, the size of the problem. And indeed if we reintroduce our big O notation, we can think of this same chart now as really representing two linear algorithms, even though the slope is different. These are really the same algorithm fundamentally big O of N as we'll say. This third algorithm here is big O of log N.

Segment 13 (60:00 - 65:00)

It doesn't really matter what the base is. Here, but this algorithm that we're seeking now, the constant time algorithm might be said to be in big o of 1. Whatever the constant factor is 123 steps, 100 steps, 1000 steps, constant number of steps. Can we get to that point? Well, let me propose that we try to get there, nope, unintended, though I guess kind of intended with tries, short for retrieval, even though retrieval isn't pronounced retrieval, a try is another type of tree. But that's not a binary search tree. And here you see yet another mash up of ideas that we've discussed today. A try shall be a tree made up of arrays. So it seems we have an evolution of data structures here where people just started mashing together different ideas, and we get these sort of amalgams of first principles. This one, a try, begins with a root node. That root node, I'm not going to draw. As abstractly as a single box, I'm going to draw it as 26 boxes just to convey more clearly that inside of this node is technically an array of size 26, and that array's locations, suffice it to say, are going to represent the letters in the English alphabet, and we're going to use those locations to implicitly store whether or not someone's name. And in turn phone number or email address or anything else is in this data structure. So for instance, if we, if we alphabetically label each of those boxes just for the sake of discussion. And now highlight T. Suppose that we want to insert the first name into this data structure. So I've gone to my phone. I've said add new friend, I type in the letter T for their name. O A for their name, and then I type in the letter D for their name because why my first friend in my phone here in the world of Nintendo at least is going to be the character known as Toad. Now I've drawn this picture a little differently at the very end. Such that when I traverse these nodes and in turn these arrays, I'm just kind of indicating with a green box here that the buck stops here. That is to say, if you can reach this point and there's no arrows continuing further, TOAD is a name in this data structure. Suppose I want to insert another name into this data structure. Uh oh, suppose that other friend's name is Todet. Well, that's fine if the next friend's name is a super. Ring, so to speak, a longer version of that same prefix T O A D E T E. We just have to indicate, as with the green box here pictorially that another name ends here. So what are these green boxes? Will it just be additional bits or maybe an additional bite. We're just using more of that canvas to sort of put a checkmark conceptually that means there is a name that stops here in memory. It doesn't really matter though for today's purposes how we implement that underneath the hood. So in other words, notice that we are implicitly storing the names Toad and towed simply by having one array for each letter in those names with arrows from the positions that correspond to those letters leading to the array that represents the next letter in that person's name. So what are these arrows? Well, those are just. Pointers or addresses, which is to say if this array has size 26, all of these values initially are blank, that is null OX 0 if you will, but for the location T, what we're going to do is change that initially blank value or null value to be the address of this array down here. And once we've typed in the letter O on my phone, I'm going to then update this initially blank value to point. At another array, and now that I'm done saving Toad's name in my phone, another byte is going to be used to indicate that, OK, even if there's an arrow there, a name does stop here. So the mere presence of non-zero values, addresses that lead you to another node in this data structure, indicate implicitly now. That the name TOAD is in this data structure and similarly, Todet is in this data structure. Now they don't all have to be super strings or substrings of each other. If I have a third friend, Tom. In this world, well, TOM can similarly be represented in this data structure, and notice here we're reusing some of those same arrays. So on the one hand there's a bit of efficiency here, but on the other hand, even at a glance with just 3 friends in my phone, you can see an inefficiency too. What's that inefficiency? Well, especially if all of my friends' names start with T. I'm wasting 25 locations in memory because I have no friends who start with A, B, or Z or anything in between. So it's a little wasteful to be using these arrays and frankly, even if I just insert a few more. Friends into this tri data structure which again is a tree of arrays

Segment 14 (65:00 - 70:00)

it's gonna get very bulky with relatively few pointers therein and a lot of null values, a lot of empty locations in these arrays. Now what's the implication of that? Well, on the one hand, we are clearly wasting a lot of space, so I am not loving that. But what's the implication time wise of all of this? Well, the running time of a try implemented in this way is said to be constant time, big O of 1. Now why is that? Well, notice, and in your mind's eye imagine there being not 3 names in this tree, but 30, 300, 3000, 3 million. It's going to take a massive amount of space. Yes, there's going to be a lot of arrays on the screen with a lot of empty value still, but how many steps does it take you to find toad in this data structure, even if there are 30 or 300 or 3 million other names? 1234. How many steps is it going to take to find Todet? 12345678. No matter how many other names and in turn arrays and nodes are in this data structure, how many steps to find Tom 3 steps, and that is independent of the total number of names in the data structure and that's a. Powerful idea. You can store as much data in this data structure as you want and can be really, really large, but it will always take a constant number of steps to find any individual name. And if realistically we assume that there's some upper bound on the length of a human name, I don't know if it's 12345678 for Todet. Maybe it's 10 letters, maybe it's 20, but there's some really long names out there if you Google, maybe it's 200 or something like that, but it is finite. I know of no one in the world or in the Nintendo world whose name is infinitely long. It certainly stands to reason that is a constant K. And even though I'm quibbling here a little bit, insofar as the maximum length of someone's name is constant, there's an upper bounds on it. Searching for, inserting into, deleting from a try is said to be constant time. That is big O of one. We found our holy grail and it's not going to devolve into a linked list with really long chains because there's no notion of collisions anymore. We've handled that even in the case of Toad and Tode as by having additional markers that keep track implicitly of where a name ends. Now do you want to use a try? Frankly, this is kind of a head fake, probably not. They really do tend to use and therefore waste quite a bit of space. And so in practice it tends to be better to use typically a hash table in some form. Now yes, there tend to be collisions in hash tables, but I chose the most simplistic of hash functions, the decision making process that determines. Given a domain of inputs A through Z, what the range of output should be 0 through 25, if I choose a fancier hash function, maybe looking at the first three letters or maybe only the 1st 2 or something more sophisticated than that, I bet I can statistically reduce the probability of collisions, even if there's occasionally some collisions, but there's not going to be a lot. Of collisions and indeed in general that tends to be appealing when you're designing a data structure that at least in the average case, the common case, it performs very well. And yes, even though there might be some perverse cases sometimes that are really slow because darn it, there's a bunch of collisions there that might very well be OK if those are indeed the exception to the rule and not the rule itself. So that said, let's abstract on top of some of this. It turns out that what we focused on thus far are indeed data structures, ways to stitch together information in your computer's memory by treating it as that sort of canvas, an addressable canvas, and we looked at arrays, we looked at linked lists, we looked at trees, specifically binary search trees. We then looked at hash tables and now tries. But it turns out in the world of Computing, there's also what we'll call abstract data types, which are sort of data structures, but they're not necessarily specific data structures. That is, they're more ideas, concepts, features that can be implemented with different data structures. So data structures tend to be among your implementation details, and I meant it when I said you might be working in a whiteboard drawing a picture. You're with a colleague trying to decide how do we want to lay out this product in memory. Now, as an aside, when you're using programming languages that are popular in industry, odds are you just let the programming language figure out how to store things in memory for you. But nonetheless, part of the brainstorming process very often does involve deciding, well, what data structure should we be using

Segment 15 (70:00 - 75:00)

in some form under the hood. But at some point you want to abstract on top of those low level implementation details and you want to consider, well, what functionality do we really care about, what is the business problem we're trying to solve in code and in turn with algorithms and data structures and offer a few examples thereof. A dictionary is a very common thing to implement in code, and by dictionary I might mean literally words and definitions, be it in English or any other language, but it might also more generally just be something associated with something else. Indeed, dictionaries are sort of the Swiss Army knives of abstract data types in that they let you associate anything with anything else. For instance, if you were to draw this two column table like this and you just wanted to store, for instance, The words here is a list on the left and the corresponding definitions in the column at right. You could also store more generally keys and values where the key is the thing on the left, the value right. So it might be word and a definition, but it might also be a name and a number as in a phone book, because if you think about it, a phone book, be it physical or in your phone, is kind of like a dictionary, but instead of words and definitions, it's indeed names and numbers or names and emails or whatever nowadays instead. So we could store data like this really on a blackboard left to right, top to bottom, but how could we implement that in code while a dictionary is just a collection of key value pairs? How can you implement a dictionary though in memory? This is where we have to break the abstraction barrier and really get into the weeds of like what's underneath the hood of the computer. Well, I could store. Keys and values, be they words and definitions or names and numbers, using an array like I could just have a really big array from left to right storing all of my friends' names and numbers alphabetically or all of the words and definitions in a dictionary alphabetically. Maybe that's OK too because it's not that often that you add new words to a dictionary, although I guess the Big dictionary manufacturers do that like once a year, so maybe that's not the best design to use an array. OK, so let's use a linked list. This would allow the dictionary every year to grow, and this would allow me and my phone to add or even remove new names and numbers on some periodic basis. So a linked list is better, but wait a minute. Link lists were giving me linear time, not logarithmics, so maybe I want to use a binary search tree. Which I could, but that too is giving me logarithm. Can't we use a hash table or try. So long story short, even to implement the simplest of ideas, a dictionary with words and definitions or a phone book equivalently with names and numbers, I can use any of today's data structures, and heck, if we really noodle on this, I bet we could come up with yet other data structures still. So an abstract data type really just offers you functionality without prescribing how it's implemented under the hood, and a dictionary allows you to look up via keys corresponding values. Well, what else are there in the world? Well, there's queues which take different forms. In IT there are printer cues whereby if a bunch of people are trying to send something to the printer at the same time, ideally those jobs get queued. So that the first person's job comes out first, the second second. In the real world, if you're lining up at a store, maybe to buy some new product, ideally the store is maintaining a queue whereby the first person in line gets to buy the product first and then the second person in the 3rd, so there's an order and not chaos there, and there's fairness, if you will. And so a cue, if you really think about it, is what you would technically call a fight. FO data type first in, first out. That is to say FIFO is a desirable property for a queue, if only for fairness's sake, because I'm going to be really annoyed if I go to the store and I'm there first, but they sort of call on the person way at the end of the line or queue. That's not really a desirable feature of a queue. I want FIFO instead. If I'm first in, I want to be first out. So in the world of Ques too there are typically operations and we can use the terms of art here to NQ means to add something to a queue. DQ means to remove something from the queue. But the way you NQ is important. When you NQ something, it goes at the end of the queue, and when you dequeue something, you remove it from the front of the queue there by maintaining that FIFO property. Now how can you implement a queue in a computer's memory? Well, honestly, I probably could just use an array and plop the first person or value here and then here and then here, and I just need to make sure that with my corresponding DQ algorithm I remove this value first, then this one, then this one maintaining first in, first out order. Now I could use an array to implement NQing and DQuing, but that is going to cap the total number of people or numbers or names that can be in the data structure. On, so maybe I don't want to use an array. Well, I could use a linked list. That's going to give me a lot more dynamism, but it's not going to make it easy to search for things. But maybe I don't really need to search.

Segment 16 (75:00 - 79:00)

Indeed, that typically is not among the functions that are provided by a queue. So it really kind of depends what properties I want. Arrays might work, link lists might work. I don't think I need anything fancier than that. Now an alternative to queues are stacks which offer a different sort of property, and you see stacks in the real world, maybe in a cafeteria where there might be. stack of trays whereby you typically take from the top, but that would not be a cue. Why? Well, Q gives you FIFO, which is first in, first out, and the first tray in presumably by way of gravity and how it works, would be at the bottom. So first out would mean sort of taking somehow the bottommost tray from the stack of trays. That's not then a cue. Similarly, in the kitchen, if you've got a stack of plates on the shelf, you're probably not in the habit of Taking first in, first out, that is the bottommost plate that you put in first. So cues don't really make sense for those purposes. Those are really better described as stacks, both literally and figuratively, because stacks in the way we've described them, offer a different property, LIFO, last in, first out. And indeed if you washed a whole bunch of plates, you stack them, stack them on top of each other, the last one in. is probably going to be the first one out by nature of having pulled it from the top. Same with all those trays. The last one in on the top is probably going to be the first one out. Now that's a desirable physical property, and there's a necessary physical property in a lot of cases in the real world, maybe not a desirable property in the computing world. In fact, your printer would be a little obnoxious if it maintained a stack instead of a queue of. Jobs because it would mean whoever sent their job to the printer most recently would get their print out first, and that just doesn't really feel like an equitable outcome there. But there are use cases for stacks, including in computing, but at a lower level that we'll perhaps see before long. But in terms of some terms of art for stacks, push is the term of art that describes adding something to the stack, even though you're probably not in the habit of pushing your plates or your trays down, popping by. Contrast is the process of removing something from the stack, which is generally, at least physically going to be from the top. Now to make these more clear, we'll turn to a friend of ours, Shannon Duvall, who kindly painted the picture of stacks and cues coming to life in ways that will hopefully remind you that as esoteric as many of today's topics might have been, in reality they are really representative of some real. world realities, be it that stack of trays or plates, be it the line in which you're waiting. In fact, I would propose to you that as you emerge from this week's lecture focusing on the real world around you, keep an eye out for data structures or at least data types, and you'll see that even if you're not a computer person, a lot of the ideas we're exploring and trying to distill into these first principles are in fact more familiar than you might think. But first, let's take a look at Jack and his stacks and cues. Once upon a time there was a guy named Jack. When it came to making friends, Jack did not have the knack. So Jack went to talk to the most popular guy he knew. He went up to Lou and asked, What do I do? Lou saw that his friend was really distressed. Well, Lou began, Just look how you're dressed. Don't you have any clothes with a different look? Yes, said Jack, I sure do. Come to my house and I'll show them to you. So they went off the jacks, and Jack showed Lou the box where he kept all his shirts and his pants and his socks. Lou said, I see you have all your clothes in a pile. Why don't you wear some others once in a while? Jack said, Well, when I remove clothes and socks, I washed them and put them away in the box. Then comes the next morning and up I hop. I go to the box and get my clothes off the top. Lou quickly realized the problem with Jack. He kept clothes, CDs, and books in a stack. When he reached for something to read or to wear, he chose the top book or underwear. Then when he was done, he would put it right back. Back it would go on top of the stack. I know the solution, said a triumphant Lou. You need to learn to start using a cue. Lou took Jack's clothes and hung them in the closet, and when he had emptied the box, he just tossed it. Then he said, Now, Jack, at the end of the day, put your clothes on the left when you put them away. Then tomorrow morning when you see the sun shine, get your clothes from the right, from the end of the line. Don't you see, said Lou, it will be so nice. You'll wear everything once before you wear something twice. And with everything in queues in his closet and shelf, Jack started to feel quite sure of himself, all thanks to Lou and his wonderful cue. All right, that's it for today. We'll see you next time.

Другие видео автора — CS50

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник