Hello world. My name is David Malan, and today is all about deploying databases, particularly when we have lots and lots of data that we want to store at scale. Now what do we mean by data? Well, really just information, and that information can be textual in nature or perhaps binary in nature, that is to say, files and such. And at the end of the day we want to store all of this data in a so called. Database, which is a collection of data in a computer system. Now databases specifically tend to be actual products that you can download and install on a computer, for instance, Oracle and SQL Server, MySQL and Postgress, but there's different types of databases altogether and along the way today we'll explore exactly what forms those take. But let's consider then, for example, what some other terms might be that we encounter in this world of data more broadly. You might hear of a data warehouse, which is really a database of databases, all of your data somehow combined in one place, a data mart, which is actually just a subset thereof. So this might just be your marketing database, financial database, or something else more narrowly defined. And then you have data lakes, which are sort of hot messes of data whereby you might just dump all of your data in PDFs and files and see. ESV files and more generally with the intent of coming back to it later, but it's not necessarily as organized as other formats might take. But today we'll focus indeed on databases which are organized collections of data, and those collections themselves can take different forms. For instance, perhaps the simplest way of storing a lot of data is inside of a so-called flat file database, which literally means just storing your data in a file, maybe a text file, maybe a binary file. Now that file might be structured, for instance. A common format for flat file databases are CSV files for comma separated values, whereby all of the textual data in your file is somehow organized effectively into rows and columns. The rows are straightforward to implement in a so-called CSV file because you just put one line of data per row, per row just hitting E at the end of each of those lines. And then within each of those lines or rows you can indicate the presence of a column or more specifically an individual cell by separating. In the data on that line with, for instance, commas. Now there's alternatives to these. You might be familiar with TSV files, tab separated values. Sometimes people use vertical bars in order to delineate data within such a file. The choice of character is really up to you, but it does influence exactly what type of software or how the software reads that file subsequently. Now if we want to take a look at an example, let's do just that. We've seen in the past some sample data. Being like names and numbers and in fact I brought such as an example with me already. Let me go over here to VS code and as usual, just to simplify the interface, let me go ahead and hide my activity bar, hide my File Explorer, and then down here in my terminal, let's go ahead and open up a file that I made in advance called phonebook. csv for again comma separated values. What do we see? Well, the first line of this file is curiously a little bit different from all of the others. It's the so-called header row in the file, and you'll see that it says name. number and that effectively indicates that this file effectively has two columns' worth of data, the first of which represents names, the second of which represents numbers, phone numbers specifically. The comma therefore indicates exactly where one column begins and ends. Therefore, below that we have those names and numbers Brian, David, Doug, Eric, and John, followed by each of their numbers with commas therein. Now we got lucky in some sense in this particular data set in that none of the names and None of the numbers naturally have commas themselves, but suffice it to say that if any of your data in a CSV file actually needed to contain actual commas grammatically, that's fine. There's different ways of escaping the data, for instance, putting all of the data in one cell, so to speak, in, for instance, double quotes. It just requires that whatever software you're using to open or process these files knows to treat those quotes in such a way that they're not literal quotes but meant to keep together. words or phrases that might otherwise have a comma that can confuse that whole process, but this data nicely enough is fairly simplistic, and we have ultimately a whole bunch of names and numbers in this file. But what we don't have is the ability right now to actually read this data very in a very user friendly way. We could certainly write some Python code that reads this file top to bottom, left to right, using something like a loop in Python. But if I wanted to look My number specifically or John's number specifically, we would essentially have to read the entire file searching for, searching for my name or John's, at which point then we have the actual name.
Segment 2 (05:00 - 10:00)
Moreover, if we want to make changes to this file or additions or deletions, that's a fairly manual process whereby we have to open up the whole file, go in there and make the changes. In other words, insofar as this is just a flat file, a file alone, there's no inherent functionality offered to. Me to get more user friendly functionality, I dare say we need to transition from a flat file database to something like a relational database where a relational database generally refers to software that not only stores your data for you but also provides you with some core functionality. Moreover, that software allows you to relate within your set of data to each other in different ways. Let's consider, for instance, a simple example. Here is a data set that we've seen before, 3 schools. Harvard, MIT, and Oxford and the cities that respectively each of them is in. It happens to be the case that both Harvard and MIT are in the exact same city of Cambridge, but otherwise Oxford is across the pond in Oxford, UK. But suppose we add another school and city to the mix, namely the University of Cambridge, also in the UK in the city of Cambridge. Now this city of Cambridge is different from this city of Cambridge. So now we have two problems. We have arguably an inefficiency. Insofar as for both Harvard and MIT, we're storing Cambridge, which seems we're at best redundant to have to say the same word Cambridge in two places per our discussion in the past of just trying to avoid duplication, even though this case factually it's true. There's also now a problem of disambiguation whereby this Cambridge is in the UK. These are in the US, but at first glance it's not obvious that those are actually different. Now we could chip away at some of these problems by just. Adding for instance Cambridge UK and Cambridge, Massachusetts, USA to be ever more precise, but that still doesn't eliminate the redundancy and it also doesn't fundamentally ensure long term that we're not going to run into the same problem again. So let me propose that we take a data set like this, which might be stored very simplistically in a CSV file, but begin to take steps to storing it in a relational database instead. For instance, let me. that we explode this single table of data into two tables initially one for the schools, one for the names. Then let me propose that we assign a unique identifier to each of those cities, for instance, a very simple number like an integer starting with the number 1 on up. These IDs could technically be anything, but by convention it's often the case that you just use simple integers starting at 1, moving up toward eventually infinity. Now this in and of itself hasn't really done much for us. Because I haven't yet related those cities to these schools, but suppose now I go over to this table of schools, and in addition to their names, I had another column that's going to allow me to keep track of the city in which that school is, but no longer using its name, using that city ID instead. So now I have an indication that both Harvard and MIT are in the same city, AKA number 1. Oxford is in the city, AKA number 2, but Cambridge is in the city. Of number 3, thereby disambiguating between the two Cambridges. Now I can maybe just for thoroughness and for futureproofing in case we encounter more problems like these, add another column to this first table for school ID. Now those numbers can and in the real world are very often the same 123 on up toward infinity, but they're not used for the same purpose. It's perfect. fine for multiple tables, all to use very simple integers, even though those integers might indeed overlap. Context is going to matter and indeed that is why in this table at left, which I might call my school's table, I have city ID very explicitly stated so that it's not going to be misinterpreted as a school ID. So these 123 4s are different from these 12 3s, but are both IDs in the sense that I'm now using them to uniquely identify my data and thus was born the first of our relational tables inside of, if you will, a relational database. This is a database and so far it's a collection of data and it's relational insofar as I have used these unique identifiers to relate some of my data to other data. The downside now is that this does not look particularly user friendly. I took the simplest and most convenient of file formats, that of a CSV, for instance, that had all of the names and numbers right there in front of me. Indeed, we saw that a moment ago in tabular form, and now I've created two tables with these arbitrary numbers that somehow create these associations, but I've just created work for myself because now I have to minimally sort of line up the numbers and figure out which school is in which city. So wouldn't it be nice if the database itself provided me with a bit of functionality that allowed me to just as easily as before get at the answers I want when it comes to this data? And the answer I might want is in what city is Harvard
Segment 3 (10:00 - 15:00)
in what city is Oxford, and so forth. Well, that too is what we typically get in a relational database, which is some inherent functionality that we can leverage to solve problems like those. So what does this Well, we're going to introduce ourselves now to another programming language, this one called SQL for structured query language or SQL for short, and SQL is typically the language used with relational databases via which to query data and also update your data and delete data and insert more data and do anything you might want with it. It allows you then with code to automate the process of more generally reading and writing data. So what does this allow us to do? Well, really 4 basic operations, and there's this acronym that might help you remember it. Crud, which is a bit crudely stated, which has 4 letters in it, each of which represents one of the core features of, for instance, SQL. C stands for Create, R stands for read, U stands for update, and D stands for delete. So even though we're about to dive into really a whole new programming language that does have some similarities with Languages like Python and even JavaScript, at the end of the day, all it can do is these four fundamental operations. Now these themselves are not necessarily the commands or functions, if you will, that will run within the SQL database, but they're quite similar. In fact, we'll see that in the world of SQL, reading data that is accessing it and using the data in some form, actually uses a command called select, but create and update and delete. Are in fact used to do any number of things. Insert also exists when we want to insert new data, and if we want to not only delete data but drop all of our data, that is an entire table thereof, we can use a SQL command known as Drop as well. So even though we're about to take a tour of a brand new programming language, you'll find that SQL is incredibly powerful when it comes to manipulating data, reading and writing it, answering questions about data. Data analysis, data science more generally, and indeed SQL is especially powerful insofar as it's a different type of programming language than both Python and JavaScript, whereas both of those languages were, at least as we saw them, very procedural in nature. If you wanted to use them to solve a problem, you had to tell the computer step by step using Python or JavaScript lines of code exactly what to do, exactly how to iterate, exactly what questions or boolean expressions to ask. SQL is known. A declarative language where it's closer to using English to just declare what question you have without worrying as much how to specify how to get from that question to the answer. You needn't worry so much about loops, so much about variables, so much about conditions. You can instead say there's a little closer to English exactly what data you would like to select, update, delete, or insert into your database. So let's take a look. Given that we have these commands in front of us, let's go ahead and explore how we might create one such table. Here, for instance, is the first of the commands that we might execute when trying to create a SQL table in a relational database for the very first time. You would literally type out Create table, then per the lower case here you would type the name for your table, maybe it's phonebook or something else. Then in parentheses you would specify the name of a column that you want to store therein. For instance, name or number, and then some specification of type. Is this a number? Is this a string of text or something else? And then you can do that one or more times for additional columns as well. But how do you do this and where do you do this? Well, you can use any number of database systems. Indeed, I mentioned Oracle and SQL Server, MySQL, Postgrass, and dozens of more, dozens more, but we're going to use a fairly simple version of SQL for this particular class so that we can focus really on the ideas and Necessarily anything proprietary in nature, and we're going to use a variant of SQL called SQL Light, and the command that we'll use to interact with this lightweight version of SQL is itself called SQL Light 3. Now this isn't to say that SQL Light is a toy version of SQL. In fact, it's very commonly used in web browsers and mobile applications alike, but it doesn't necessarily have as many features as some of those commercial and other open source tools might have that are particularly well suited for the largest. Sets of data. Now we'll still use some fairly large data sets today, but not necessarily in the millions and billions of rows that you might have out there in the real world. So how can I use SQL Light 3, for instance, to create my first database and within it some table? Well, let me propose that we first create a database as follows. Let me go back to BS code here. Let me go ahead and close phonebook. csv, but in a moment we'll use that within our own database. I'm going to go ahead and increase the size of My terminal window here in order to see more on the screen and I'm going to type this command SQLite 3 phonebook. db just by convention. I could call this file anything I want
Segment 4 (15:00 - 20:00)
but db will indicate to me and other people that it's a database and indeed I want the data therein to represent that of a phonebook. Now how can I actually get some pre-existing data into this database? Well, at the moment there's nothing inside of it. I've just created. An empty database, but there's a few commands supported by SQL Light 3 specifically and Oracle and SQL Server and MySQL and Postgrass and others have comparable commands. I'm going to go ahead and temporarily put my SQL Light 3 program into CSV mode. I'm then going to import my existing phonebook. CSV file into a specific table, and then I'm going to go ahead and quit altogether just to prove that it's worked. All right, so let's do that. Let me go back here to VS code. Let me go ahead now and at my SQL light prompt, note my prompt has changed from the usual dollar signed to SQL light and an angle bracket to indicate that the SQL Light 3 program is still running. I'm going to go ahead and type mode CSV and nothing seems to happen, but in general that's a good thing if I see no errors. I'm then going to go ahead and import the file called phonebook. csv, which recall is in the same folder. which I ran this command a moment ago and I could call this anything but naturally I'm going to call the table into which I want to load all of this file's contents quite simply phonebook. And now you'll know within SQL Light's prompt I'm using these dot commands to indicate that these are specific to SQLite 3, but before long I'm going to no longer have those dots in front of my commands because they're going to be SQL commands from the programming itself. Uh moving forward, so I'll hit enter there too. Nothing seems to happen. So let's do. quit to exit out, and I'm back at my prompt. At this point in time. I now have a brand new database file called because I chose this name, phonebook. db, and I can proceed now to reopen that, for instance, and show you what's inside. So let me again run SQLite 3 of Phonebook. db and hit enter. I'm back at my prompt and nothing seems. has happened yet, but if I run schema now, this is going to show me the so-called schema of my database. The schema of a database is the design of its columns in particular, the names thereof and the types thereof, as well as potentially some additional constraints. So here at my SQL white prompt, if I do. schema, what I'll see is a command that I didn't actually run but that was automatically run for me when I imported that. Obook. csv. This is not necessarily the typical way that you would create a database out of a CSV file, but it's one way. Though in the future when we know in advance what kind of data and how much of it we're going to have, and it's not necessarily a file we're importing from the start, we're going to go ahead and design the schema for our future tables and then load in data. But in this case I've used that. import command specific to SQLite 3 just to get things started more quickly. And you'll see here effectively the line of SQL code that the import command automatically triggered for me. Create table if not exist, just in case I might have run this command before, quote unquote phonebook, which is the name of the table I wanted to create a column called name, the type of which is text, and another column called number, the type of which is text as well. And all of that as before is inside of those parentheses. And then lastly I finish. My thought here it would seem with a semicolon which in many programming languages SQL often among them is how you would terminate a thought, much like a period after an English sentence. Now I'm back at my prompt, but there's no data apparent just yet. All we've seen is the schema, so to speak, of my database, but not the data therein. So how do I get at those names and numbers within this new database? Well, here's the canonical format of a select command. Via which you can select that is read or access data in one of these tables. The command syntax is to say select, then you can specify one or more columns that you want to select from a table, and you can specify the name of that table. Now I've by convention here done this. I've used all capital letters in uppercase for anything that's SQL specific. So there are keywords in SQL, much like we saw in Python and JavaScript, among which here are select and from. But the columns happen to be my choice of names, as is the name of the table. So just for clarity, I've used lowercase there. That's not strictly necessary. SQL light and in general SQL the language won't really care what you uppercase with respect to commands, but the column names, the table names, those should be considered case sensitive so that you are consistent with how you name them from the get-go. So how can I go about selecting data from this first table? Well, let me go back to BS code here. Let me zoom in a bit so the text is a bit bigger. And now let me go ahead and clear my SQL command my SQL Li prompts so that I can focus entirely on what's new. I'm going to go ahead now and just per that canonical example type select. And now I want to select the columns of interest which will be name
Segment 5 (20:00 - 25:00)
number, and I want to select those two columns from the table called phonebook, and I'll finish my thought with a semicolon. When I now hit entry. You'll see using sort of Asy art, if you will, the tabular form of the data that was loaded from that CSV now into this SQLite database, and here we have the column of names and the column of numbers. But notice this perfectly corresponds to what I selected. Now if for some reason I don't care to see the numbers at this point in time, I can do something slightly different. I can just say select name from phonebook, and indeed SQL. Of me just that. Meanwhile, I can do select number from phonebook and similarly will the database oblige. So already here I seem to have more expressive capabilities than with something like a flat file database in CSV, whereby even though I could open that CSV either in VS Code or Microsoft Excel, Google Sheets, Apple Numbers, I wouldn't necessarily have this precision with which I can select subsets of data. I could see it with my human eyes and to be fair, in those Sheet programs we do have lots of functions, but we'll see we have functions inside of a SQL database as well. Now it turns out when you want to just select everything in a database table, if only to wrap your mind on what's inside of that table, turns out there's a common convention which is to use a wildcard, which is something we've seen in the past, albeit in different contexts. I can do select star from phonebook, and no matter what the columns in the phonebook table are called, this is going to go ahead and give me. Everything as we saw before, so it's not strictly necessary for me to enumerate each and every column, especially if there's lots of them. But indeed SQL has supports a whole bunch of functions as well. And particularly when your data is numeric, you can do things like average things together, find lower find max maxima and minima. So you can if it's textual in nature, you can force things to uppercase or lower case. You can check for distinct values that are unique and you can even. Count the amount of data they're in. So how might we use some of these? Well, let me go back to VS code here again. I'll clear my terminal and propose that if your question is not what are the names herein and what are the numbers, but how many of them are there, we can do this in a number of ways. We can do select, count the number of names from phonebook and then hit enter there and we get back 5. equivalently, I could do select count number from. phonebook and I should get back the same result. In fact, when using this count function in SQL, it's quite common to be a little more succinct and say select count of star because you don't really care which columns you're counting because by definition of how SQL tables work, you're always going to have the same number of cells in every row even if some of them might end up being blank. So it suffices to say select count star from phonebook. Then a semicolon and enter and I'll get back indeed the same response. But suppose now I'd like to know what the distinct numbers are in this phone book because we've seen before that some of the staff actually share a main number. To do that, I could do initially select number from phonebook, but that again is going to return all of the rows irrespective of duplicates. And so I can apply one of these functions now and say select distinct number from phonebook. And as you might guess, I should only now see 2, the two unique numbers that are across that phone book. Now it's obviously easy for my human eyes just to count 2 unique numbers, but suppose this were a larger data set and you were interested in how many distinct values there are, but not necessarily what those distinct values are. Well, you could combine these same ideas and I could say select. Uh, count of distinct number from phonebook, semicolon enter, and I get back my answer of 2. So already I seem to have more declarative abilities with this particular language where I can just select exactly what I want. Now it's not quite as user friendly as using natural language in English, that is your own natural human words, but there's a relatively simple set and finite set certainly of keywords that you can use in SQL. These then are just some of those. Well, what more can we do with SQL? Well, it turns out we can be even more expressive with keywords like these. I can group related data together. I can search for something like a value, but not necessarily exactly some value. I can limit the results to just the top 10, top 100, or some subset of the total number of rows. I can order the data that is sorted by a particular column in ascending or descending order. And then I can have a predicate of sorts whereby I can say only give me the rows where the following boolean expression is true. So within SQL we have some of the same ideas that we saw in a language like Python and then JavaScript, but you'll find that by reading it left to right, it's more akin to just asking the computer system for what you want. So for instance
Segment 6 (25:00 - 30:00)
let's go back here to BS code. And let's select the number for John specifically. Now if I were to do this in Python or even JavaScript, I would probably open the file, read up from top to bottom, left to right, constantly checking with a conditional and a boolean expression, is the current name John? And that's fine. That would be a very procedural approach to the problem. But in SQL, I can declare my intent as follows. I can select the number from the phone book where the corresponding name equals quote unquote John using single quotes here in SQL by convention, hitting enter and I get back just the number I care about. Now conversely, if I want to see all of the names of staff who have the same number in mind while I can do select. A from phonebook where the number in question equals quote unquote plus 16,174,951,000 semicolon and enter and now I get back Brian, David, Doug, and Eric who all share that specific number. Now of course I can just eyeball this and see that 4 of us have the same number, but if I only care about the count, I could do this. And now to save some keystrokes, I'm going to do a. technique much like my general command prompt whereby I can scroll back in time with the up arrow and move down back and forth to go through my history. If I hit up just once I get my whole command again. I can go ahead now and scroll here and say, you know what, don't just show me the names, show me the count of the names. So I'm going to insert that function call that we saw before of count, hit enter, and now I get back quite simply the answer. But what if I want to see in this phone book the number of times that each number appears in total, so ideally 4 and 1 respectively. How can I do that? Well, you can express that notion in SQL as well. Let me go ahead and select first a number, but also the count thereof using familiar syntax from that phone book, but this time group those numbers together. As follows by saying group by number, you're telling SQL to essentially look through that column of numbers, group all of the like numbers together so that you can perform this aggregate function of count on those grouped by numbers. So if I hit enter now, we'll see that ah, that first number, which is common across several of us, indeed is shared by 4 names, but the last of those numbers who's just John's only has a account. Therefore, of one. So again, using syntax like this, we can pretty readily express the questions that we might want to ask about this data set or any other. Well, what more can we do with SQL? Well, it turns out if we want to not only select data that is read, we can also delete it. For instance, if I decide that, you know what, I don't want to be in this phone book anymore, I can delete my own self. Let me go back to VS code here. I'll clear my prompt and begin now with a new command. And delete from the phone book. Where? How do I filter myself out? Well, let's see, I don't want to just delete from the phone book entirely because if in fact I were to be a little sloppy and dare say hit a semicolon here, don't hit enter, that would delete everything from the phone book. Therefore, I want to use a more narrowly defined predicate, we clause that specifies that I only want to delete. rows where some boolean expression is true and I think if I'm seeking to delete myself, I really want to say where name equals quote unquote David and then hit enter. Nothing seems to have happened, but as we begin, let's do select star from phonebook and see what's now in the database. And indeed I am now gone. We see now Brian Doug and Eric with those same numbers, but John is there as well. But no David. All right, well, suppose I quickly regret that. How can we put myself back? Well, it turns out in SQL you can not only select and delete but also insert data therein. So here is the canonical format for how you might insert a new row or even rows into a table. You literally say insert into and then the name of the table. Then in parentheses. You specify one or more columns into which you want to insert new data. Thereafter you say literally values, and then in another set of parentheses you literally specify one or more values that corresponds to that number of columns that will have the effect then of inserting into a new row precisely the values you indicate. So let's see how this might work. Let me go back to BS code now and let me insert into that phone book the name of David, as before, of course you might realize now I'm forgetting to insert one other value, but that's fine. It's up to me to insert if I so choose just a subset of data that seems to be successful. Let me scroll back in time to commands and re-execute select star from phonebook to now see the result. And indeed we'll now see that David is back
Segment 7 (30:00 - 35:00)
albeit in a different order, but that's fine because I never seemed to care about the order previously, but my phone number is null. Null then is the special sentinel value, which is distinct from the empty string, that is the, the presence of a value that just happens to be blank. Null means we never even put anything there, and it's a so-called sentinel value that's useful because it indicates the deliberate. Even accidental but glaring omission of data. It's not just that the user left something blank, it's that your command, your SQL statement didn't insert anything at all. So how can we resolve this? Well, thankfully there's the U in crude for update via which we can not only select and delete and insert but also update our table. As follows the syntax here is a little bit different, but it's to say update and then the name of the table, then set followed by a whole bunch of key value pairs as in the past. The name of the column that you want to update and the value give it. And if you so choose a comma separated list of additional columns. And values, but in this case I'll do just one, but I want to specify where this update should apply. If I were to omit the where clause and just end my thought with a semicolon, this would update the entire table setting every value in that column to the same value. So how might I do this more carefully? Well, let's go ahead and do update phonebook, set the number to quote unquote plus 1 617495 1000 where and only where the name is quote unquote David. If I had omitted that where clause here. What would happen? Well, I don't think it would really affect Brian, Doug, or Eric, but it would affect John insofar as his number was different but would no longer be if I didn't specify that this update should only apply where the name is indeed David. Now nothing seems to have happened, but indeed that tends to be good. Let me scroll back in time and select star, and now we see that I'm back in the phone book. Now if you don't like this ordering for some reason, or the whole point of this database is to show an alphabet alphabetical list of your. Contacts. Well, that too we can address as before. Let me go ahead and do select star from phone book, but let's order this by the name column. Now you'll see that everyone's name is in alphabetical order as before and the numbers still correspond appropriately. If though you wanted to do this, not in the default order, which happens to be ascending ASC for short, but descending D ESC for short, you can do the same, and now the order of the names is entirely flipped. Can apply that ordering to any column in some table. All right, let's transition now from our own tiny phone book to an actual real world database that comes to us from IMDB, the Internet Movie Database, which has a whole lot of information on actors and TV shows, movies, and more. Within this database, you can imagine the need to store data in a fairly careful fashion because if you have thousands, maybe even millions of rows, you're gonna want to store things as efficiently as possible. So how Might you go about storing, for instance, the names of some TV stars in a database like this? Well, we could, for instance, for a TV show like The Office, put the name of the show in the very first column, and then thereafter in each subsequent column, we could put the name of a star in the show. So we have star, star and star at least for those who receive top billing at the start of this particular series. And then for any other show we could add another row and then each of their main stars. As well, but no matter what show might come to mind, you can imagine the number of stars in it not necessarily being 5. So some of those subsequent rows might have fewer columns or more columns simply depending on the number of TV stars they're in, and that should rub you the wrong way because if nothing else visually, the data is going to become very ragged whereby none of the edges, so to speak, will line up and that in and of itself isn't so much a problem, but it does suggest a sort of sloppiness whereby your data. is not all the same, so this tends not to be the best solution. And indeed, in general and relational databases, if you're in the habit of adding more and more columns, you're probably doing something wrong unless the schema fundamentally for your data has very deliberately changed. It's much more the common case to add more and more rows to a data set in a relational table, and so we should aspire to do something more like that. Now we could, for instance, restructure things as follows into A 2 column table thereby assuaging my concern, but now we have sort of a new problem that we've seen before in other places whereby I've got lots of duplication the office, which if nothing else is just wasting space now by storing literally the same name of a show multiple times.
Segment 8 (35:00 - 40:00)
And as in the past, it might invite mistakes at some point because if I change the capitalization or spelling of some show's name, I might not remember to make that change in all five places. So this is just asking for trouble. So how might we do this a little bit better? Well, let's propose that we do it as a relational table and introduce as before, some of those unique identifiers. Let me propose, for instance, that we have one table here for the shows themselves. In this case just one, but you can imagine having more and more rows for more and more shows. In that table we'll have a title colum. For the title of the show, but we'll also start to have an ID as well. And rather than start at one, I've consciously chosen 386676, which happens to be the unique identifier in the actual Intern internet movie database for this version of The Office. Meanwhile, we'll call this new table shows, which makes perfect sense. Meanwhile, let's have another. Table for instance over here that has the names of all of those TV stars, but let's associate with each of them a unique identifier as well. And again rather than start at 1234, and so forth, I'm using the actual identifiers from IMDb, for instance, Steve Carell's is 136797 arbitrary but consistently the case throughout their entire data set. Now unfortunately at the moment at a glance there's nothing linking this data together, nothing relating one table to another, but even here I think we can solve this not necessarily by adding a new column to these shows or now these people, but in some cases you can even use a third table that somehow relates one to the other. And in fact I'll propose here in the middle that we could have a 3rd and final table maybe called. Cars that creates that relation across these two tables. The first column of which I'll say is show ID, which corresponds to the ID column in the show's table. The second column of which is person ID, which corresponds to the ID column in the people table. Now this is just one convention and different people, different companies might have different ways, but it's not uncommon to name your tables in the. Plural form of these words to then name the unique identifiers in most of the tables quite simply ID in all lower case and then in an intermediate table like this whose purpose in life is to relate one table to another, to use the singular form of the table followed by an underscore followed by ID. But again, capitalization, spelling conventions might differ, but this is just one way that I'll use uniformly. As we explore this particular data set. Now, unfortunately here too, even though you might now see that oh, the stars table links this particular show 386-676 with this particular person 13 67 97, it's a lot more work certainly for me as a human with my human eyes to now link these pieces of data together. I can see of course that this ID here corresponds to this show here and this. Person person here, but it was a lot more convenient before when it was just all right together in the same table. So on the one hand we've sort of addressed some of the problems we identified of redundancy, perhaps ambiguity, and the like, but we've introduced new problems and that the data is like all over the place. But this too is a problem that SQL can solve for us. Now just to be clear, let's consider exactly what data is in common across these three tables. So Highlighted here is the same show ID in the show's table as well as in stars, indicating that all of the corresponding person IDs are somehow associated with that show. Meanwhile, between the stars table and people table, do we have the same IDs pair wise in each of the same, indicating that each of those people by way of the stars table is indeed associated with that show. Meanwhile, there could of course be more stars and more shows, so we'll indicate as much with. But let's take a step back now and look at the entirety of our version of the IMDB database, which we indeed imported for you and for class into these relational tables. Here then is a diagram that represents the data we're about to explore. You'll see some familiar titles here now like shows and people as well as stars, but it turns out there's more tables in this database like writers and ratings and genres. Let's focus for now though on just 2. These and consider how we might implement the simplest, if you will, of relations across just two of these tables before we then allow things to escalate into that relationship across 3 total tables. So pictured here as shows is the table that has we'll soon see an ID, a title, a year, and some episodes. So this isn't quite as simple as our imagined version of IMDb which just had the stars and the name. Of the shows, but here we have associated with each show, a unique ID, the title thereof, and the year in which it debuted in the total
Segment 9 (40:00 - 45:00)
number of episodes. Meanwhile, over here in another table are the corresponding ratings for that show which will map back to the show's table by way of a show ID, a rating on a scale of on a numeric scale, as well as the total number of human votes that contributed to that rating. Now strict. Speaking, we could have put all of these ratings inside of the show's table by just adding, for instance, a rating column and a votes column, but in this particular case, the two pieces of data came from two different sources, so it's not unreasonable to claim that this table will relate to this one specifically in a 1 to 1 relationship. That is, every row on the left will correspond to one row on the right, and that's something we can ultimately enforce at the database level. So what do we mean by this? Well, consider that the show's tables ID column, the unique identifier for each show, will naturally line up with the show ID column in the ratings table. Now let's go back to BS code here and clear my SQL light prompt and quit out of SQLite 3 altogether. Now instead of opening up Phonebook. db, which was the database we ourselves created using Phonebook. csv as an import. Let's go ahead and open up a different database that I brought with us based on IMDB's own data. And to do that, I'm going to run SQLite 3 as before, but this time specify shows. db, which is a much larger database that's been made in advance. That brings me back to a SQL Light prompt. I'll go ahead and clear the same and now run select star from shows, but I don't think I want to see all of the shows at once. Because there's indeed going to be thousands. So let's go ahead and limit the result to just the top 10, the 1st 10. If I go ahead and hit enter, we'll see here the 1st 10 shows in the actual internet movie database, including its ID, its title, the year in which it debuted, and the total number of episodes. Meanwhile, let's wrap our mind around the other table called ratings as follows select star from ratings limit 10. And I dare say whenever opening a SQL database for the very first time, just selecting star from one or more tables is a useful if quick and dirty way to wrap your mind around the actual data therein. And just limiting the results again just ensures that your screen's not going to overflow with more data than you care about is just going to give me a sense of the format of the data. And indeed here we see 3 columns in this table show ID, rating, and votes. So how can I now maybe see just how large each of these tables is? Well, I could certainly run the entire command and just wait and wait for all the rows to come out, but let me instead clear my screen here and then do select count of star from shows, and we'll see in this particular database we have some 245,200 shows. Meanwhile, there's probably nearly as many ratings assuming that all of those shows have been rated. How though can I see how the actual tables themselves have been implemented? Well, for that recall, we can run. schema, and schema is going to show me exactly how each of these tables has been defined. And let me scroll back to the top of this command, and we'll see top to bottom these tables here. We have genres and people, ratings and shows, and down here stars as well. Let's focus on the few that we've talked about thus far. So at some point or other I or someone else ran the command create table shows. Then in parentheses we specified a comma separated list of all of the columns that we want to create and the corresponding types thereof. We'll see now that ID is specifically integer and that lends itself to being a number like 123 or much larger. We'll see that title is text, but moreover, it's not null. In fact, this is another feature of a proper database where like Whereby unlike in a CSV where you can put anything or nothing if you want in a database, you can specify that users should not be able to insert or update values that contain null so that you ensure that every show in this table is in fact going to have a title. Thereafter we see that year is numeric, which seems to be another type altogether, and then episodes itself makes sense, is an integer. Moreover, we then see something. Mentioning primary key, but we'll come back to that in just a bit. Meanwhile, for people here, not surprisingly, we have ID, that's an integer name that is text and not null, Birth, which is presumably their birth year, which is numeric, and then again some mention of primary key of ID. But lastly, let's focus here now on ratings which relates somehow to these shows. We have here a show ID which is an integer that's not unique, that's not null but is unique and Indeed ensures that we'll have a 1 to 1 relationship ultimately. The rating is a real number, which means a floating point value, something typically with a decimal point that similarly cannot be null. The number of votes is an integer, that the 2 is not null. But notice here some mention of foreign key referencing show
Segment 10 (45:00 - 50:00)
a foreign key called show ID referencing the show's tables ID, and we'll come back to all of these keys in just a bit. So now that we have that in front of us, let's take a quick look at the actual data types available to us. We saw in Python that we had things like integers as well as strings or stirs for short. In SQLite specifically, you have access to these primary types. Now in other databases like Oracle and SQL Server, MySQL and Postgress, you have an even longer list of available types that allow you to impose further constraints. And formatting on your data. SQL light though, insofar as it's a bit lightweight allows us these five choices here. Now integer speaks for itself. Text is just text. Re is a floating point value. Numeric is something that is like a number, typically a year, a date or time, or something else that generally falls into some standardized numeric format, and blob or binary large object really just means a binary value of. Sort, maybe even a file in the database. The files generally belong in the file system stored in a folder somewhere. So that's why we've seen not just text but integer and other types in this file. We can see that there are other constraints we've imposed like keywords like not null, which ensures that the database won't even let you omit some value and unique, which ensures that the database won't let you insert duplicate values. But let's now revisit these keywords that we saw in that schema for IMDb's database, namely primary key and foreign key. We've been using the first of these and really the second without calling them by name thus far. A primary key is a column in a table that uniquely identifies each of its rows, typically with some simple integral value like 123, or something much larger than that. But a unique value that uniquely identifies every row in that table. A foreign key is the presence of a primary key in another table, and it's these foreign keys that enable us to create these relations across tables. For instance, as before when we had our shows table, which had an ID, and then we had a stars table which related to that show ID. In the first of those it was a primary key. In the second so-called foreign key. Similarly, in the people table, which had a ID uniquely identifying each person in the stars table we had person ID, which in that context was a foreign key referencing the people table's primary key. So the use of these terms is just context dependent, but they refer ultimately to these unique identifiers. That not only uniquely identify data but allow us to cross reference it in other places. And so in the schema here in VS code, we see the ability in the database itself to specify that for instance within the ratings table insofar as there is a show ID column, that show ID is in fact a foreign key that references this other tables. Primary key and we can see this elsewhere as well. If I scroll down to the bottom of the schema and reveal the create syntax for the stars table, we'll see that the stars table, as we saw pictorially has a show ID column and a person ID column, both of which are naturally integers and not null. But there's now two foreign keys in this table, namely show ID references the shows tables, primary key ID and the Person ID column references the people tables ID primary key as well. So verbose and not necessarily syntax that you might want to memorize, and indeed there's software out there in the world that allows you to click some buttons and generate commands like these which are not necessarily easy to remember, but ultimately reading these things line by line does specify and make clear to you exactly what our columns are and what each of the types therein is. But there's another command and SQL that's even more powerful perhaps and allows us to somehow link back to the original material we care about. That is to say, thus far in trying to impose these relationships across tables we've created all of these new identifiers, namely these primary and corresponding foreign keys, and to the human eye now it's a lot more work to try to line up all. Those values and figure out what data relates to what. But in SQL there's indeed an actual keyword called join that allows you to specify how you might take one table here, one table here and join them together so that you actually have full fledged access to all of the data you care about as opposed to it being in multiple places. So how might we use this pictorally? Well, here then. Our snippets of two tables, namely we have our shows table here and our ratings table here, and there's certainly more data in each of these tables, but for now we'll focus just on the office. In the office, recall we have a unique ID of 386676
Segment 11 (50:00 - 55:00)
and it stands to reason then insofar as I claim there's a 1 to 1 relationship between the show's table and the ratings table that there's going to be a row hopefully in the ratings. Table corresponding to show ID 386676. Now for the purpose of this story, I don't care what the rating is, but suffice it to say it is there. But these are, these pieces of data are in two different places. Wouldn't it be nice if we just had one table with not only the show's title but also its rating, much like we would have if we used a darn spreadsheet or something like a CSV file. Well, let me go ahead and propose that just for the sake of discussion. We flip around these columns just so that you can see things lining up more visually clearly, although the computer certainly doesn't care in what order the columns are. Let me propose that we now highlight the values in common, the primary key here and the corresponding foreign key here. Let me propose just for the sake of discussion that we nudge these two tables together to make clear that these are in fact the same values, so we might as well omit the duplicate. It one and demonstrate that we could somehow join together these two tables so as to get one new temporary table, if you will, containing title and ID and rating that we care about. And in fact, I probably don't even care about the ID itself, the primary key. That's useful for uniquely identifying the data, but as a human at the end of the day, the question I presumably asked is what is the office's rating. So this is the end result I care about as we've painted this picture on screen. How can we actually do something similar? In code, let me go back to BS code, clear my prompt, and execute the following. Select star from shows. Join ratings so as to join those two tables specifically, but how do I want to join them? them on the shows table ID column lining up with the ratings tables show ID column like that. Now I could go ahead and hit enter now and I would see the joining of those two tables on that those particular columns. But let's be more precise. Let's specify that I only care about this for now, where the show's ID happens to equal 386676 semicolon. And what I should see now on the screen is the composition of the joining of those two tables such that I have ID, title, year in episodes from the first and show. ID ratings and votes from the second for the office specifically. Now if I don't care about all of those values, why I shouldn't be using the wildcard. So let me go back in my history, scroll to the start of my statement, and select not star, but for instance just title and rating. My commands a bit long, so it'll wrap onto two lines, but I'm going to go ahead now and just hit enter, and now I will see that the office, specifically that office, has a rating of 9. 0. As an aside, there's multiple versions of the Office, including the one in the UK, but the fact that I've specified this specific unique ID gives me the one that debuted in the US here in 2005. Now, of course this is only useful insofar as I know the unique identifier of the office being 386676, which is not going to be the norm and certainly not something I would type into IMDb. com and Our own user interface, but for now it does demonstrate how we can using a SQL join query combine these here two tables. If I wanted to do all of them, of course I could rewind in time and just exclude that predicate altogether, and executing this query would give me a big list of all of the shows and corresponding ratings from top to bottom. All right, well, let's focus on another type of relation now altogether. Besides the 1 to 1 relationship that we apparently have between shows and their ratings, it turns out that there's another type of relation manifest in this same data set between shows and genres, we have a one to many relationship instead, and it stands to reason because a show could be both a comedy and a musical or a drama and thriller or any number of genres could be associated with some show. So what wouldn't. Really make intuitive sense to require a 1 to 1 relationship which would mandate that every show have just one genre. So one to many allows for just this one show can have many genres as well. Well, how might this play out? Well, let me propose that with a one to many relationship we can see data like this. Here on the left we have another excerpt from the show's table for Cat Weasel, and on the right we have an excerpt from the genre's table. Meanwhile, notice. That the ID for Cat Weasel here is 63881, and the show ID that we have excerpted here on the screen is exactly that again and again because Cat Weasel apparently is both is an adventure, a comedy, and family show.
Segment 12 (55:00 - 60:00)
But how could we answer a question like, well, what genres does cat weasel belong to? Well, here we can see at a glance what they are, but how might we join the data together to get at this answer as well? Well, let me propose here. That we take a look at the schema for genres by going back to VS code here and clearing my screen. Let me go ahead and do schema genres to see specifically how it was implemented. Create table genres was the table's name, and then in parentheses we have a show ID, which is an integer that's not null. We have a genre. which is just text and not and all and then a foreign key on that show ID column that references the show's tables ID column. Now as an aside, this table isn't really as rigorously implemented as it could be because the mere fact that genre is text would seem to imply that we might have adventure, adventure. Adventure in multiple places in duplicate. So what we seem to have stopped short of with this particular version of IMDb is further normalizing the data set and making sure that we also have unique identifiers for genres which we do not have in this case. So we stopped short of designing this perhaps as well as we might. But it's now clear that there's still room for improvement. But for now, let's take the design of this table for granted and let's poke around for the same data therein. So at my SQL like prompt to let me select from genres where the show ID I care about is indeed. 63881 semicolon and that gives me back just as we saw on the screen adventure, comedy, and family. Meanwhile though, that's of course not the most user friendly command just as it was not particularly user friendly to get the ratings for the office by having to know its unique show ID. So let's do better now and moving forward. It turns out that you can not only execute individual SQL commands, but you can nest them as well. So I could use one command to figure out what the unique ID is for cat weasel and then embed the result from that query into the actual question I care about, which is what are its genres. So I might do this select genre from genres where the show ID I care about is not a number I know or remember, but I know it equals to the following Select ID from shows where the title I care about equals cat weasel, close, close parenthesis semicolon. So now, much like in grade school math, anything inside the parentheses will be executed by the database server first, then the answer will come back, hopefully the number I care about, which is 63881, that will effectively be substituted here so that the show ID I search for in the genres table will be exactly that number. All right, let me go ahead now and hit enter, crossing my fingers that this will work, and indeed I get back the same answer. But in this case, I, as the human, the programmer, did not need to know or care what that unique ID is. Now imagine going to any website where you're searching for information, maybe IMDb. com itself. No normal human is going to type in 63881 to find out information about cat weasel. What they're probably going to type in is cat weasel. So presumably if you think about a website like IMDb. com, what they're probably doing is a SQL query. much more like this, whereby in that text box in the web form they're grabbing the user's input, presumably cat weasel in this case, and then constructing a SQL query like this in their own codebase on their server that effectively copy pastes the user input here and maybe scrubs it so there's no danger and then ensures that they can get back, no matter the ID, what the corresponding genres are for cat weasel. So here we have again. An example of a one to many relationship that allows us to get back the data we care about. But there's another approach we could take for this one to many relationship that's more like the 1 to 1 relationship we explored earlier. Rather than use this nested query, I could alternatively use a join as well. And in fact, just as in programming more broadly, there's often multiple ways to solve the same problem. So let's see that in this particular case. Let me propose that we treat these examples as follows. On the left here is an excerpt from the show's table, and on the right here is an excerpt from the genres table as well. Just as before, in a way that the computer won't care about, let's just for discussion's sake flip these two columns around to make clear that Kat Weasel's ID is indeed going to line up with these show IDs here. And in fact, if we highlight the same and then nudge these two tables together, you'll notice. That this table is not actually the same number of rows and columns everywhere.
Segment 13 (60:00 - 65:00)
So in fact, if you were to join these two tables together as we did before by pairing up the ID column here as the primary key and show ID here as the foreign key, you're actually going to necessarily need some duplication of data so that every row has every. you care about. So in fact if you were to join these two tables together, the result set you would get back would actually look a bit like this. Now we can eliminate some of this redundancy by getting rid of the duplicate foreign key, but you'll see here that the title, even if we get now rid of the primary key, is duplicated again and again. And this is not necessarily something we can avoid if we indeed want to join the two tables together. We can. Can avoid it by taking that different tact by actually using the nested query, but let's just see this in action because it's perfectly valid and it's actually quite common to join two tables together even if you might get back temporarily some duplication, if only because it makes it easier to get the title and genre and title and genre iteratively all from the same result set. So let me go back to VS code here, clear my prompt, and let me propose that initially we'll select star from shows. We'll join shows with genres, specifically on the shows tables ID equaling the genres tables show ID. And now rather than end our thought and see every show and every genre associated with it, let's just narrow the scope, not by Limiting to some number of results, but specifically limiting ourselves to 63 8 81 just because we know that it's cat weasel. Let me hit enter and we'll indeed get back and make larger table with all of the columns because I use star and indeed we see that same duplication. Now I can whittle that down if I only care about at the end of the day these genres for cat weasel, and I can hit up. My history, go back over here, replace the star with just genre, and now we're back to an answer akin to the nested query as before. So again, this is only to demonstrate that we can in multiple ways solve the same problem, but the question we cared about and the answer we wanted were was initially what genres is cat weasel, and that we've now had in a couple of ways. Well, how about one more type of relation by revisiting our larger data set? We've seen thus far 1 to 1 between shows and ratings. We saw just a moment ago one to many between shows and genres. Let's now come full circle when we looked at the TV stars and the shows that they're in by focusing lastly on a many to many relationship, namely these three tables herein. Recall that shows is a table with ID title year and episodes. Recall that people has an ID name and birth year for that person. Notice though, to be clear, there's no mention of shows in the people table and there's no mention of people in the show's table. For that we had that third intermediate table. Now we'll call it a join table that has 2. in common show ID with the show's table and person ID with the people table and that third table is what relates one to the other shows to people and people to shows. But of course in the real world it stands to reason that one person could be in multiple TV shows, and it stands to reason that one show has multiple people ergo a many to many relationship. So when you have this type of relationship that they say is more involved than any of the others, how can we get at answers to questions we might care about, whether it's me sitting behind a desk analyzing some data or me as a user sitting on IMDb. com just typing in queries? Well, let me go back to VS code here and clear my prompt. And let's go ahead and just poke around this data initially. Let me go ahead and select star from shows where the title of the show I care about is quote unquote, The Office semicolon. Well, we'll see that there are in fact several versions of The Office over the years, but we'll focus for now on the US version here with Steve Carell and team. In fact, clarify then that the office I care about not only has a title of the office, but it also and has a year of because 2005 is a numeric value. I don't need the quotes. It's not a string of texts per se. And so this should now narrow the results to the same version of the office we've been talking about. Now suppose that I wanted to get. All of the people who star in this show, that is to say, how can I explore that many to many relationship? Well, All of the people in this data set exist in the people table. All of the shows in this data set exist in the shows table, but the linkage between them, we'reca is the star's table. So I kind of need to follow some bread crumbs from one table to the next
Segment 14 (65:00 - 70:00)
in order to get back the data I actually care about. Suppose now that I want to get all of the people who starred in this particular office. How can I get from the show? table to the stars to the people table ultimately. Well, let me propose that we select first only the ID of the show that we care about so that I don't have to remember 38 66 76. Well, let me repeat the same query, but instead of selecting star, let's specifically select ID, which gives me just that. Now using this show ID, can I get all of the person IDs. Associated with it. Well, I can if I use the stars table which links precisely those columns together. Let me execute select person ID from the stars table where the show ID I care about equals 386676. But again, I don't want to have to memorize or care about a number like that. So let me parenthetically copy and paste the previous query. And for clarity, I'll go ahead and put this on a new line. You'll see that SQLite 3 gives me a to indicate that I'm still in the middle of a statement. So let me parenthesize this fully and then say semicolon on the outside. So when I hit enter now, what I get back rather cryptically are all of the person IDs that line up with this show ID in the stars table. But these two is not what I care about. I care about the actual. People and specifically their names. That's OK. We've already gone from the show's table to the stars table by way of the show ID lining up with the person ID from the former to latter. How now can I go to the people table using these person IDs? Well, I don't think that's that hard. I can nest those two queries together as follows select the name of the people whose ID. is not equal to but in this whole set of person IDs. Now I could go through and start copying and pasting all of these person IDs, but I don't want to do that. Instead, let me parenthesize exactly that query as before. Select person ID from stars where show ID equals, and then in parentheses, let me put that in. Select ID from shows where title equals the office and year equals 2005. Let me close parenthesis, close parenthesis semicolon, and what you'll now. See step by step is that on the outermost part of this statement I'm selecting the name from the people table where the ID of that person is in the list of person IDs that comes from this second line here from the stars table, but the show ID that I look up in the stars table comes from this final query whereby I select the ID of the show that I care about, which at the end of the day is The Office from 2005. And if I hit enter now, if you've watched this show, you should now see some familiar names, not just the lead cast, but even more of the TV stars from that same show from IMDb. So a bit of a mouthful and a bit of multiple steps to get there, but as you get better and better with SQL, much like you might Python or JavaScript, these kinds of commands come to you much more readily and can you save these commands or cache them in some form. So that you can answer the same questions again and again, not necessarily about this particular show, but even some others. In fact, let's go ahead now and maybe lastly in this many to many relationship flip things around and ask a whore for all of the shows that starred Steve Carell specifically, flipping the many to many relationship in the other direction. Well, how might I do that? Well, at the end of the day, I'm probably going to need Steve Carell's ID. I don't know what that is offhand, so let's ask. The database select ID from people where the name of the person I care about is Steve Carell. I'm not going to, I'll go ahead and now run this command, but that's just going to give me his ID 136797. But I don't really care about the specific number, so let me go back and parenthesize this query here because what I really care about are all of the show IDs that correspond to that person ID in the star's table. So let me do that. Select show ID. from stars where the corresponding person ID equals specifically Steve Carell's name. Then let me go ahead and execute that query and I get back now a whole list of show IDs. Well, let me go ahead now and do everything together, but ultimately select the title from shows which I ultimately care about. So select all titles from shows where the ID of the show is in. The following query select show ID from stars where person ID equals, and then the very first query select ID from people where name equals quote unquote Steve Carell close paren close parent, semicolon.
Segment 15 (70:00 - 75:00)
And what I should now see is not a whole list of opaque identifiers for each of those shows, but indeed all of the shows, according to IMDb. That Steve Carell has been in, among them the Office. But that's not the only way we can do this. And just to show you that there's other ways too, I'll do this one a little more succinctly and right off the bat. But you can also use SQL joins to answer questions like this. For instance, I could do this. If I know I ultimately want to join together people and shows by way of those stars, I could express myself as follows alternatively select title from shows. the stars table on shows ID equaling stars. show I joining further on the people table on stars. person ID equaling people ID where the name I ultimately care about is. Steve Carell semicolon. So it's a different approach to the problem whereby I'm essentially joining all three tables together but then filtering them out by this predicate where name equals Steve Carell. And that again gives me, albeit more slowly, the shows he's been in. Now there's a third and final way that I could execute this using really an implicit join. So joining the two tables together without interestingly using the join keyword explicitly, and I can do that as follows. Select title from shows, stars, and people as a comma separated list of table names where the show's IDs column equals the stars tables show ID column and the people tables ID column equals the stars Tser ID column and the name that I care about equals quote unquote. Steve Carell. So notice that by way of this predicate I'm implicitly joining the shows with the stars table, the people and then additionally filtering by Steve Carell's name. In this case now when I hit enter, after a pause. Indeed, a bit more slowly, I get that same result of the titles of all of Steve Carell's shows. Now of course there's even more data in this database. We've looked at people and stars and shows. We've looked at genres and writers and ratings, but there's also this other table called writers, and it turns out the writer's table has a relationship with shows and people quite like that of stars. In fact, the column names are the same. Name show ID and person ID, but the table's name captures the relationship that of writers. Now here too is maybe a missed opportunity to improve this data set further because it would seem to indicate that we have the TV star's relationship captured here, the writer's relationship captured here, but what if there's other roles on the show set? What if there are producers and cinema? and yet other roles in the industry, it would seem that we need a separate table for each of those roles, each of which is structured exactly the same, just has a different name. Odds are if we spent a bit more time, I bet we could factor out that commonality and avoid having stars and writers and even more tables and perhaps more generically have a roles table that somehow captures the association of people with shows. All right, what we saw a moment ago. That not all queries perform the same and that indeed depends on some of the underlying implementation details of the database itself, particularly if we have not indexed our database tables. That is to say, there's yet another feature of relational databases whereby if you know that you're going to be performing certain queries and certain queries quite commonly, you might want to optimize the underlying representation of that data by creating indexes on your tables. And what an Index allows you to do a syntax like this in SQL lights, create index, and then the name of an index on a specific table, specifically on in parentheses a comma separated list of one or more tables. When you create an index in this way, what the database essentially does for you is build up in its memory a B tree data structure. Now this is not short for binary tree. A B tree is a bit different. It looks a little something like this, which is a fairly wide tree that might have many leaf. Nodes ultimately, but those leaf nodes are as close to the root as is possible, giving us ideally some kind of logarithmic running time. That is to say, by building this data structure and memory related to columns that you know you're going to search on a lot, you can avoid having the database resort to something like linear search, searching the entire column from top to bottom for a name like Steve Carell or a title like Cat Weasel rather by storing that data somehow in the database's memory in this tree. Structure you can get to that data more quickly than something like linear search alone. In fact, let's see the effects. Let me go back to VS code here and let me turn on a feature of SQL Light called a
Segment 16 (75:00 - 80:00)
timer which will allow me to time how much time it takes for each of my subsequent commands to execute. Let me now do a query like this select star from shows where the title I care about is as before the office and any number of offices. Notice now that I Back all of those results, but in particular I see just how much time it ultimately took. In real time it took 0. 038 seconds, which isn't all that long. But if you imagine a larger data set and you imagine a popular application like IMDb. com itself, you can imagine that even this 0. 038 seconds adding up, adding up over time. Now this has very real world implications for usability. The longer a query takes, the longer your users are going to have to wait and in And the more users you have, the more servers you're going to need in order to support lots of users, each of those queries is taking a good amount of time. So in other words, if we can shave even milliseconds off of this time, odds are that's going to save our users time and me money because I don't need as many servers to support that same number of users. So what I could do, for instance, if I know that my users are frequent, frequently going to search on a column like T. I could go ahead and create an index in advance on precisely the same. In fact, let me go ahead now and create index called Anything I want. Perhaps I'll call it title index to make clear what it's used for on the table called shows specifically on the title column. Notice now when I hit enter that query did take a moment, almost half a 2nd, 0. 357, but that's an operation I'm only going to have to execute once. To the database over time, if I continue to insert and update and delete data in this table, might need to make modifications to that index and for instance the underlying B tree, but odds are if I'm generally reading data, selecting data from this database as I would if I have lots of users, odds are the reads are going to be much more common than the rights, so the cost of this creation of an index and maintenance thereof can be amortized surely over time. Now let's execute that first query yet again. Select star from shows where title equals quote unquote the office. Now recall a moment ago that query before I created the index took 0. 038 seconds, which didn't sound like much, but now that we have that index and the underlying data structure and memory, voila, we're down to essentially 0 seconds. We don't even have enough significant digits here on the screen to capture just how fast that new version. We're at 0. 000, which surely sounds fast. So better response times for users, less money for me, fewer servers in our cluster. We're probably now in better shape simply by creating these indexes. Now I didn't necessarily want to do that for all of my columns and all of my tables because as with anything there's ultimately going to be a trade off. These trees and memories surely start to take up additional space and for tables that might be actually updated frequently, it might actually slow down my insert. my updates, my deletions, but certainly for the most common of queries like searching on a website or application, I probably can choose to optimize some of those columns in some of those tables. Now, in fact, I could go back perhaps and maybe engineer a few more indexes on the columns and tables my previous query touched when looking for Steve Carell, which would in fact, I dare say, expedite just how long it takes to join some of that data together. Now up until now we've been focusing entirely on SQL, which again in the context of relational databases has me storing all of my data in rows and columns, which often works well but sometimes breaks down or at least makes it much more difficult to store data, especially if that data is hierarchical in nature. Indeed, an alternative approach to storing data might be using a technique known as noSQL instead for not SQL or really not only SQL, which stores rather than rows and columns what we Might call documents or objects instead. For instance, if you would prefer not to use rows and columns and indeed SQL itself but would instead just like to keep all of your related data together as from IMDb, you could store some similar information in a document like this in a format that you might recognize as JSON, JavaScript object notation, which in no SQL context is used as well. And here you have ultimately the same data that we've been exploring, for instance, in IMDb, if this is an therefrom whereby we have a unique ID that by convention in this particular form is underscore ID as the key followed by colon, which indicates here comes a value, the value of which is in this case 386676, which of course is the unique ID that we've seen for the office. Meanwhile, if using a noSQL database and indeed this particular example thereof
Segment 17 (80:00 - 85:00)
you might have another key called title, the value of which is not surprisingly, the office. And then if you want to keep all of the related data together as opposed to Putting it somewhere Uts over there and requiring a language like SQL to join everything back together, it is commonly done in the no SQL world to bundle everything up in one document or object such that another key here is stars, the value of which, as indicated by square brackets, is an array or a list of additional objects. Indeed, these curly braces here indicate an object or dictionary as we've seen in prior contexts whereby each of these objects represents a. A person who has two keys and values an ID as well, like 136797 for the name Steve Carell and so forth. So the appeal of NoSQL is often that you can just grab all of the data you care about right away, but you don't necessarily benefit from some of the same features we've seen thus far, some of the constraints that a SQL database can impose like uniqueness and guaranteeing that some foreign key reference is a primary key from one table to another, nor do you necessarily have features. I like the ability to add indexes to find very specific information quite as efficiently and indeed there might be redundancy too if you're storing some of these objects without references but in the raw objects. But it's an alternative approach that's quite popular as well, but is distinct from SQL in that the data is no longer in rows and columns but in a more hierarchical structure that can be nested instead. But when it comes to databases, no matter what the format you use, there are some fundamental problems and risks that even today so many companies and programmers encounter. One of the first of those is what might be said to be race conditions which rear their head in all forms of computing. But in the context of databases, what can go wrong might be this. If you have, for instance, a very popular website like that for social media where users can click on an icon in order to like a post, for instance, well, for A server implementing that social media network, you might have some code that ultimately is using SQL code like this, whereby when someone clicks on the like button, the database of course needs to update its records to say that oh, we have one more like in our system. Well, how might you figure out what that new count should be? Well, there might be an underlying query like select the current number of likes from the posts table representing the social media posts where the idea of the post in question for the sake of discussion is 13. Assume now that value is stored somewhere in memory and the second SQL query is now necessary on the server in order to update the current number of likes to the new number of likes. That is plus 1. So for instance, if the current number of likes was 50 at that moment in time, we of course want to update the post table setting the number of likes to 51 where the idea of the post to update is 13. That's fine and at a glance, correct. But what race conditions allude to is the fact that even well-intentioned commands like these might lead to unexpected results, especially at scale. If you've got not only a lot of data in your database, but a lot of users using your application or website. Imagine that these two lines of code are being executed in parallel really on lots of different servers, even in the simplest case where you have 2 servers to handle twice as many users. Suppose that 2 users out there in the world both click the like button on the same post at essentially the same time. And because that it takes a little bit of time for their request to go from their browser to your servers, imagine that on one server this first command is executed and the answer comes back indeed as 50. But imagine before that one server has a chance, maybe it's a bit slower, to execute the second query. Suppose that the other user's like hits your other server and it runs the same line of code and gets back. The same answer 50. At this point in time, both servers' memory has in mind the number 50, in which case they'll both add 1 to 50, get 51, and both servers will thereafter proceed to execute update posts at likes equal to 51 where ID equals 13. Where is the bug? Well, because both of them inspected the value of likes in the database at essentially the same moment in time, but then it made a decision thereafter based on that value, you're going to end up storing indeed 51 in the database instead of the correct answer, which should be 52. So this race condition refers to really sort of two competing requests racing to finish but getting interwoven temporally in terms of time accidentally, and that is because fundamentally these two queries can be interrupted or at least interthreaded with others because they're running in different places on different servers, but ultimately. Accessing somehow the same database. You can think of this metaphorically in the real world like I
Segment 18 (85:00 - 90:00)
was taught years ago by one of my own professors who encouraged us to consider a scenario in which you live with someone like a roommate and you have a refrigerator in which you like to store milk, and one of you comes home one day, opens the refrigerator, and sees, ah, we're out of milk, so you close the fridge and you head out to the store. to go buy another jug of milk. Meanwhile, your roommate comes home who also likes milk, opens the fridge, realizes, Oh, we're out of milk, closes the fridge, and heads out the door, maybe in another direction to go get some milk. Of course this story ends with both of you coming home, putting now not one but 2 jugs of milk in the fridge, which was not the desired result. And why did that happen? Because both of you made a decision based on the state of the world, refrigerator, without somehow blocking the other from making a decision at the same time. How might you have avoided that problem? Well, you could leave a note on the fridge saying like gone for milk, or you could lock the fridge physically, and funny enough, that's essentially the solution to this same problem in databases whereby if there are 2 or more queries that you want to execute together. So to speak atomically such that they both execute or neither of them execute, but there's no interruption. You can introduce what are known in the world of databases of locks or transactions whereby instead of executing just two lines of code, you can first tell the database begin the following transaction, then execute these commands and then commit my changes before someone else can then do the same. And effectively these transactions prevent two different queries being interwoven together. As we saw might happen with the likes on a post or the milk in a fridge, and if something does go wrong with transactions, you typically get the feature of rolling back where at least one of those servers can say, uh uh, I'm just going to show the user an error message and roll back without trying to change the database myself. But there's one other problem we should consider in the world of databases, specifically SQL, known as SQL injection attacks. It is quite common to use SQL and some relational database like SQL Lite or Oracle or SQL Server, MySQL or Postgraph. Or something else to store all of the data for some website or application. Of course websites and applications often have text boxes of some sort that allow you, the user, to type in queries and maybe search for some data or update or delete or insert some data into that database. The problem begins though if the programmer who wrote the SQL queries trusts that user input and assumes that they're not going to do something malicious like themselves run delete or themselves search for Drop or other potentially dangerous commands. And it turns out in some cases if your code is Poorly written, be it in SQL and also Python or some other language, you can accidentally be tricked into users injecting code into your system. Now what do I mean by this? Well, consider for instance a login form that might look like this asking for the user's username and password, for instance, email and password here. Suppose though that the programmer who implemented the authentication underlying this website didn't know about SQL injection attacks and is all too willing to trust users' input. But suppose I in this story am a malicious adversary, and I suspect that as a programmer you're in the habit of using single quotes, and I suspect that you're in the habit of using SQL databases. Well, a single quote is certainly something I can type on my own keyboard, for instance, curiously, not at the start of my name, but at the end. And it turns out that we've not seen this before in SQL is how you begin a so-called comment, a note to self that does not, is not meant to be executed but essentially blocks out everything to the right and tells the database to just ignore it. So curiously, I've typed in my email address maillin@harvard. edu, but I've finished that thought with a single quote and a dash. Well, let's see where things might go awry. Suppose on the back end server is the code the programmer has written that somehow uses a SQL query like this. Select star from a table called users where the username is whatever the user typed into the form and the password is whatever password the user typed into the form. Notice that I've preemptively. Put single quotes around the username and the password, and much like in Python, even though this is meant to be pseudo code here, the curly braces just mean plug in the username there, plug in the password there, but quote them because they're strings. They're not something like numbers. Well, what can happen if a malicious mailen at Harvard. edu Passes that input is this SQL statement might become naively where username equals quote unquote malin at Harvard. edu, close quote, which recalls the quote that I typed in which tells the database ignore everything else.
Segment 19 (90:00 - 92:00)
That is, don't even worry about what the user's password. It is, and if you assume that this query is ultimately being used to figure out if we do have a user with user named Maillin at Harvard. edu and therefore should they be allowed to log in, suffice it to say this query if you essentially trick the database into ignoring the entire password check could very well allow an adversary, whether it's me, Mailly at Harvard. edu, or someone pretending to be me. Log into the server without even knowing the password, thereby injecting SQL into the database. Now, in this case, it's just as equivalent to none of that even being there, and so the user might get in. Now what's the solution? It turns out there's a fairly simple solution, and it one stems from not trusting user input. Indeed, when programming code, you should generally code defensively. Assume that users will make mistakes or at worst will be malicious, and you don't want to trust what they've typed in. But that's fine. There exist plenty of libraries and in turn functions that allow you to format your SQL queries when writing code on a server to look more like this, where you very deliberately use placeholders that often in many libraries are simply questioning. Marks and leave it to someone else's code, the library you're using to figure out how to insert the email address that someone's typed in into this placeholder, how to insert the password that someone's typed in into this placeholder and let the library worry about quotes and escaping and generally sanitizing the user's input, so to speak. There are well defined ways to do it. There's sort of a checklist you might go through, but if you are new at programming or just not necessarily versed in this attack, you might miss. Of those important details, best to stand on the shoulders of others and very popular libraries, perhaps open source tools so that your code is much safer as a result. But the lesson ultimately is do not trust users' input. Now on that downer of a note, let's emphasize that ultimately our goal has been to store data at scale, moving from the simplest of files like text files like CSVs to proper relational databases or heck even no SQL databases. In the former though, the paradigm is just. Store all of your data in rows and columns and thanks to features like indexes, not to mention all of the commands that SQL supports, can you navigate that data by selecting and inserting and updating and deleting and even whether you're using SQL Light or indeed Oracle, SQL Server, MySQL, Postgress, or any number of others, you can scale your websites and applications well beyond the hundreds and thousands and hundreds of thousands of rows that we've seen here, but to millions if not billions as well with the same.