Hello world. My name is David Malan, and today is all about securing systems. Now what do we mean by security or cybersecurity in particular? Well, generally it refers to keeping our systems safe from, from harm, from theft, from intrusion, but I dare say it's helpful perhaps to think about a couple of primitives via which we can begin to secure our systems, the first of which we might call authentication, the process of proving who you are as by logging in with the username and password. Word, but authentication alone is not necessarily sufficient for keeping a system secure, because just because you can demonstrate who you are and have an account on a system doesn't mean you should be able to do anything and everything on that system. For instance, deleting files or accessing data that maybe you're not authorized to access. In fact, another primitive to consider is just that authorization, some form of access control that specifies once you have authenticated and proved who you are, what resources should you and Should you not actually have access to. Now, of course, in the real world it's very much the case that you and I are authenticating all of the time and how are we doing this? Well, generally by way of using passwords and in fact odds are you have tens of hundreds of passwords nowadays, not to mention usernames, but it turns out we humans are not very good at even choosing good passwords. In fact, it's commonly the case that passwords are, are discovered as by hackers getting into database. are somehow finding access to usernames and passwords, and those passwords then leak out onto the public internet and people can download, and abuse them, of course. Well, fortunately there's at least some security researchers out there, the good guys, so to speak. They can also mine that data for lessons learned. And among the lessons we can learn is just how common certain passwords actually are. And in fact we can come up with the effective, effectively a top 10 list of passwords we shouldn't use as a result. In fact, Over recent years, such passwords as these have been discovered by security researchers to be among the most common, which is not in fact a good thing. For instance, among those 10 are 123456, which certainly doesn't take much effort to come up with, but kind of suggests what the policy is on the systems that are allowing that password, namely they seem to require passwords of length at least 6. Also common, 123456789, which is marginally more secure insofar as Suggests that those systems on which that password was found at least has a larger length requirement for passwords, for instance, 9, but also on the list are passwords like password literally, which is taking a nails a little too much on the head there, but password, of course, is just an English word, so probably not all that hard for someone to guess. Password one suggests a bit more complexity because we've actually thrown a number into place. QWERTY, of course, is yet another option, but For those unfamiliar, what does that signify? Well, if you happen to have a US English keyboard in front of you and look at the top several characters, you'll see that the keys on the keyboard spell in fact QWERY. So a strange looking word but nonetheless a word on the keyboard nonetheless. QWERY 123. Now this one's a little interesting insofar as it suggests that systems on which this password's been found at least require some alternation between lowercase and uppercase letters and maybe the use of some numbers as well, so marginally. than QWERY alone. 111 isn't all that compelling, but slightly better is 12, 3123. So at least there is less repetition, but still a pattern in that one. I love you is both adorable and insecure because even though it's three words, they're very commonly put together as used in someone's password, and then another one which you might think is actually pretty good capital P at symbol SSW 0RD. Well. Unfortunately, if you are thinking that you're pretty clever by substituting other characters that kind of look like letters of the English alphabet, like the at sign for A and the 0 for O, well, so do the adversaries out there, the bad guys on the internet, the hackers that try to get into your systems know that people like you and me might use those heuristics, and so that's going to be among the first things they themselves try when trying to get into your or someone's account. Indeed, what is the lesson learned really from the top 10 list? Here, well, suffice it to say that you do not want to be on this list because if your password is among those here, any smart adversary when trying to get into some system is probably not going to try some random passwords to get in. They're going to download like I did the top 10 list, the top 100 list, the top 1000 list and start with those passwords, if only because probabilistically those passwords are going to get them into some system sooner
Segment 2 (05:00 - 10:00)
than any other passwords they might actually guess. Now while you might see. There's some password policies manifest in here. The reality is that there's far too much predictability in all of these. We have the numbers here, which is not that all that hard to generate, certainly on the keyboard or even with a program that hacks into some system. We have English words and English words appended with some numbers that you could certainly tack on very easily. And in short, even something like this I love you, which is three words mashed together, all three of those words are in the dictionary and any smart adversary for. For instance, after trying a so-called dictionary attack, downloading a really big dictionary of lots of English words, trying to brute force their way through all possible passwords, is certainly going to eventually exhaust those words and then maybe start concatenating together, joining together two English words and trying all possibilities. 3 English words trying all possibilities, not to mention the fact that this is already on this here top 10 list. So there's some lessons learned even in these, and so if you do have already. Passwords that are either on this list or all too reminiscent of passwords on this list, it's time to change for reasons that we'll now soon see. So what are brute force attacks as the name sort of conjures up sort of from yesteryear, the idea of using like a, a big battering ram to try to brute force your way through a castle door, just trying really, really hard, not doing anything particularly sophisticated, but trying again and again until you can enter that system or in that case. That castle, a brute force attack digitally nowadays really refers to using some kind of software or even hardware to automate the process. For instance, even if you don't know what the password might be, you could certainly try all possibilities. If the password's at least 4 digits long, for instance, you can try all possibility of 4 digits, and you can work your way up from there. In fact, let's consider then how secure a system would be if protected by only a 4 digit passcode. And this is Common nowadays on cell phones, for instance, you might minimally have a passcode that is at least 4 digits long. Now that might seem like a lot and that might seem certainly better than nothing, and it probably is, but just how secure is it? Can we perhaps start to slap some numbers on the question of how secure is a system or phone in this case by considering the actual constraints on those codes? So for instance, if you are required to have a 4 digit passcode and those passcodes are entirely numeric 0 through 9. Well, there's 10 possibilities for the first digit second, times 10 for the 3 4th, which of course gives you 10,000 possibilities. So here we have already a sort of measure of just how secure the system is in the sense that an adversary in order to brute force their way into a phone protected by a 4 digit passcode is going to have to try as many as 10,000 possibilities. Now on average they might To try half as many, so maybe 5000 possibilities. They could get really lucky and your passcode is 0000, which funny enough is not uncommon as the default passcode on certain systems, but in general they might need to try as many as 10,000 possibilities. Now for a human adversary that might be pretty darn tedious typing in 0000, ah, it didn't work. 00001, didn't work. 0002, ah, didn't work. All the way up to 9999, that could indeed take quite a while. But if we have software at our disposal, even like a programming language like Python, we've seen it's not all that hard to write a loop, like a for loop or a while loop that just tries all possibilities and heck, if I could somehow connect my laptop or any device to a phone that I have stolen and maybe automatically send all possible passcodes through that wire to the phone, well, I can probably do things quite quickly. But how quickly? Well, let's see if we can't simulate this using our old friend VS Code. In fact, if I switch over here to my programming environment, as always, let me go ahead and hide my activity bar as unneeded. I'll go ahead and hide the Explorer as unneeded so we can focus entirely on my terminal and on my code tabs. Let's go ahead and create a new file called crack. ie, where crack itself is a term of art that means to try to figure out a password, often by a brute force. So I can write this program in any number of ways, but I'm going to try to keep it short and sweet and really demonstrative of how relatively easy it is to write password cracking code like this. At the top of my file, I'm going to say from string import digits. We haven't seen something like this before, but this is really just giving me access to the decimal digits 0 through 9. I could literally type them all out on my keyboard, but this is using a the string library, which has a list inside of it for all of those 10 digits, and this is just a nice way of getting all of them, which is going to be convenient in a moment for reasons we'll soon see. Then I can import some other symbol here from Iter Tools for iteration related tools.
Segment 3 (10:00 - 15:00)
I'm going to import a function called product which is essentially going to allow me to very easily take the cross product of those 10 digits with each other again and again. And then quite simply, I'm going to have a loop. So I'm going to say for each passcode the cross product of all of those digits repeated a total of 4 times, and why am I using this syntax? Well, if you read the documentation for the product function in the IterTools library, you'll see that you can pass in a list of values to take the cross product of the 10 digits. As I described earlier and how many times do you want to do that, why I want to repeat this process 4 times to effectively go for 0 through 9 times itself times itself to get me those 10,000 possibilities in a nice 4 loop. Then just for the sake of discussion, let's just go ahead and print out each of those passcodes, but in practice I would probably take the phone that I've stolen. I'd grab a USB cable or the like, plug it into the phone, laptop, hit enter, and automate this entire process. But for our purposes, we'll just send all of the output to the screen. All right, I'm going to go ahead and hit enter now after running Python of crack. ie, and I'm sure this will take a while, so I'll take my time walking over to the screen so we can see just how long it takes to try 10,000. Possibilities. Well, that was quite fast, and in fact you don't see all 10,000 on the screen because they flew past, but I'm already down at 9999 and indeed if I scroll all the way back up, I'll see that we started at 0000. So that was like what a second to go through 10,000 possibilities doesn't seem as though. That 4 digit passcode is keeping your system all that secure. Well, what if we try to do better and most phones allow you to upgrade from 4 digit passcodes to maybe 4 letter passcodes. So you can do something alphabetical using the English alphabet or some other. But let's suppose for the sake of discussion it's 4 letters this time. How many possible combinations are there of 4 English letters? Well, I would propose that there's 52 times 52 times 52. Why? Well, there's 26 letters in the English alphabet A through Z, but if we allow ourselves uppercase and lower case, we can double that. So that's going to give me 52 to the fourth power, which is how many? Well, it's definitely more than 10,000. It's going to be 7311,616. Now that's quite a bit more in order of. Magnitude more. Can we modify my code though to try hacking into a phone that's using one of these 7 million passcodes? Well, let's see. Let me go back into VS code and make the slightest of changes. Instead of importing digits, I'm going to go ahead and import all of the so-called AI letters, the A through Z, that we care about, and that's an easy change up here. Asky letters, and I know this exists by Having read the documentation of the string library, which gives me this as a list of possibilities, and then I'm just going to change digits here to Ay letters so that we iterate over A through Z, both uppercase and lower case a total of 4 times. I'm going to go ahead into my terminal window, clear the screen. I'll make it even bigger so we can hopefully see more at once, and I'm again going to run Python of crack. ie and stroll on over to the screen here. Now thankfully this time it's not done yet. In fact, it looks like it's going to take a decent amount of time, but as these passcodes scroll across the screen, still going rather quickly, one column more so than the next than the next, but we see we're in the lowercase fs now, we're now in the lowercase gs, and so forth. So this will continue some time, not only through the lower case letters. But also the uppercase letters. But if I had to guess, I don't know, we're looking at a couple minutes here maybe. And in fact, because my VS code is in the cloud, it's actually a little slower because all of these characters have to be sent over the internet to my laptop here. I could run this even faster if I ran the Python program on my own Mac or PC, but I don't think this is going to take terribly long. It's not quite. As dramatic as you might see in the movies where you gradually see each of the numbers being filled in, which isn't really a thing, but this will exhaustively eventually get to Z Z Z. So it's better, it's more secure. Why? Because I've begun to raise the bar to the adversary. And indeed this is what we really mean when we talk about the security of some system. It's not in absolute terms, but in relative terms. And the reality is that a system protected by a 4 letter passcode, which means as many as 7 million plus possibilities, is arguably more secure than a system protected by a 4 digit passcode which had only 10,000 possibilities. Why? Well, the attack is fundamentally still the Same, you can just brute force your way in and eventually given enough time, you will get in. But we have indeed raised the bar to the adversary in the sense that this is now going to take them minimally more time. With more time might come more risk because in the movie version of this they would only have so many seconds or minutes before someone comes out and discovers that you're trying to hack into the phone. So of course you want the code to run as quickly
Segment 4 (15:00 - 20:00)
as possible and if you've got to go through more combinations, it's going to take you quite longer. It might also equivalently. Take the adversary more resources, more money, for instance, if they want to speed up that process now they need to invest in an even faster computer, a faster laptop than mine in order to churn through more possibilities. So there's a tradeoff here though, because what's the downside now? Arguably it might be a little easier for you to remember a 4 digit passcode than a 4 letter passcode, if only because the letters just allow for so many more combinations indeed. But even this, I dare say 4 letters. is not really going to keep the adversary out because if we flip back to VS code, we'll see that we're approaching the Zs and again, if we were to run this on another system, it might go even faster, but there's not all that many. There's only those 7 million, and I think if only for the sake of closure here, let's hang in there until we actually hit the end of this list Z Z, and we're now there. So it took a couple of minutes it would seem. All right, well, can we do even better than this? Can we make that process take even longer if we really want to decrease the probability that the adversary is going to get in? Well, what if we use 4 character passcodes, so not just letters and not even just numbers, but 4 characters more generally, where a character might be a letter of the English alphabet in this case, a digit like 0 through 9, or heck, even some punctuation. Well, if we do that, we have as many as 94 possibilities for each of these characters because we've got what, 52 letters and we've got 10 digits and it turns out some 32 punctuation symbols that you might typically see on an English keyboard. So we have 94 times 9494 x 94, which is definitely bigger than our last two values. This time we're going to have. 874,896, which is another order of magnitude bigger. But here's the rub. It's not all that harder for the adversary to wage an attack on this here system. Indeed, if I go back to VS code and shrink my terminal window and then clear it, going back into crack. ie, it's not all that hard to add in not just asking letters and digits, but some punctuation too. So let me import that. digits plus punctuation and then I can combine all three of these lists into one bigger one so as to try 94 possibilities 4 times. Of course I could go even bigger than that. Maybe we should graduate from 4 characters to 8 character passcodes. This is going to be similar math, but this time it's 94 to the 8th power effectively. This is getting pretty darn big. This now is going to be what we're in the millions, billions, trillions, quadrillions, 6 quadrillion possibilities, which is a pretty darn big number, and we don't have enough time during the day to actually run this code. Why? Well, it's actually going to take quite a bit of time to wage this attack. For instance, if we consider just for the sake of discussion. That maybe each of these operations takes 1 2nd. So checking one passcode takes 1 2nd. Well, we can do a bit of math, and if you've got to work your way through 6 quadrillion possibilities, I dare say at 1 passcode per second you will spend 193 million years brute forcing your way into this system. So based on. That back of the envelope calculation alone, it would seem that eight character passwords are already pretty darn good security of course unless your passcode is 0000,0000 or something else similarly guessable or something else that's on one of those top 10 lists or something else that's not in a dictionary that can be easily checked, but that's. Quite a bit more possibilities and probably makes your system more secure. So why is it that so many websites and applications are somewhat annoyingly making you and me come up with not only long but seemingly random passwords because you kind of want to be there in the sweet spot of those quadrillions of possibilities so that the odds are of the adversary reaching you are themselves quite small. Well, how can we defend against this attack nonetheless, because the trade off here, of course, when you have longer, more complicated passwords is that it's going to get harder for you and me to remember these things and then you and I are going to resort to reusing the same password or using the same password but changing it ever so slightly for different systems. Any time you and I resort to a behavior like that or. like that, we're making ourselves more vulnerable if only because think about it, by transitivity, if you've got a password that looks like this on one system and the same password elsewhere, as soon as one system gets attacked, a smart adversary is going to try stuffing that same password into the other system to see if you are using the same there or maybe modifying it ever so slightly. So if the attack is so simple, it's like 4 lines of Python code, how can we defend against this kind of attack?
Segment 5 (20:00 - 25:00)
Well, we can indeed make our passcode requirements longer and more complicated, but that just has a trade-off with usability. It just shifts the onus onto you and me to now remember these darn things. So what if we at least have some other defenses in place like this of rate limiting? For instance, what if we indeed only let the adversary. Use one password per second. Well, that's going to take them quite a few million years to get through the list. Odds are you and I will be dead before it's actually a problem. So that's one way to view it. Alternatively, it's probably unreasonable to think that if someone is trying a million+ passwords, odds are they are not you. This is in fact an adversary trying to get into your phone or. Device. After all, even if you've forgotten your password or if you're misremembering it, you're not going to sit there trying something millions of times, probably not thousands, probably not even hundreds of times, maybe 10 times, maybe 12 times, and at that point beyond, it's probably less and less likely that you are you and more likely that it's an adversary trying to get in by trying to batter the door down. So with rate limiting, what you can do on an iPhone, an Android phone, or the like is perhaps you could have a built-in defense such that if the user fails to input the correct passcode after 10 attempts, well, you know what, let's pump the brakes now. Let's lock the phone, give the user an explanatory message that says something like, Please try again later, come back in 1 minute. All right, maybe a minute later you or in this story, the adversary tries again another 10 times, still doesn't get it. Well, maybe this time we pump the brakes more and say, you know what, come back in 5 minutes, come back in 10 an hour. Among the funnier posts I've seen online actually when it comes to rate limiting is someone supposedly got a message on their phone that said, Please try again in 16 years, which is a little ridiculous, but that seems to be the case of maybe they're not being an upper bound on just how many times the user is told to try again and how much delay is added each time. So with weight rate limiting, we don't fundamentally Change the threats or how the adversary can wage this attack, but we do effectively raise the bar so much higher that the adversary is just not going to stick around to continue waging that same attack. If they can only do 10 passcodes at a time in between which they need to wait minutes or hours or heck years, well, they're just going to lose interest in you probably as a target and move on to someone else. In fact, in the real world. It's often said that by locksmiths that you don't need to be the most secure house on the block. You just need to be more secure than your neighbor. You don't need to have 100% security, having all of your doors locked with the best of deadbolts and the like. Rather, you just need to have better locks than your neighbor so that the adversary turns their attention. To a lower barrier to entry. And so if we're raising the barrier to entry to the adversary not only by choosing more complicated longer passcodes, but also significantly increasing the cost to them in time, in resources, in risk, odds are that might very well be sufficient probabilistically to keeping the adversary out. But if nonetheless we're left with all of these passwords of varying complexity, how do you possibly keep track of them if, like I've claimed, you shouldn't be reusing them, you shouldn't be using similar ones in an ideal world, you would have a very long, unique password, seemingly random for every application and website you use. Well, that's a lot to manage and of course behaviorally you and I might resort to fairly rational but unwise behaviors like, well, I'm going to just write it on a Post-it note on my monitor, which is all too common, say in the workplace or home, or maybe you have a little printout in your drawer of all of your passwords or something equivalent. Well, that's all fine and good, and it might help you and decrease the barrier to entry to you, but it really decreases the barrier entry to an ad. who might walk past that monitor and see the Post-it note or who might pull out the drawer and see all of the passwords there. So better nowadays is to use what are called password managers, software that you can download, sometimes buy and install, also software that increasingly comes with today's operating systems and devices to do a lot of the management of your passwords for you. And typically what a password manager is a piece of software that someone's written that allows you to. Generate good passwords, long, seemingly random. More than that, save those passwords in its memory, ideally in a secure form so that no one can just poke around your computer and find all of your passwords. And indeed these password managers themselves are typically protected by one main account password, a primary password that ideally is indeed long, complicated, seemingly random, but that you have. Committed to memorizing. In fact, the proposition by a password manager is stop memorizing and remembering all of these other accounts. Just remember this one account password for your password manager. And when you use it to log into the password manager, then you have access to all of your other usernames and passwords alike.
Segment 6 (25:00 - 30:00)
So on the one hand, there is a heightened risk here. You're putting all of your eggs proverbially in the same basket such that if you're Password manager's password is compromised. Now all of your accounts are compromised, but the reality is if in practice you and I are using Post-it notes or cheat sheets or reusing the same passwords or using bad passwords anyway, odds are, even though there's this new risk involved, odds are it's still a net positive because your overall practice would be more secure. And again, there's the relativity. Than your prior practices when it comes to your own systems and devices security. So password managers on the whole are a very good thing to use if at least it's better than your current practices. Well, what else is a good practice? Well, increasingly you're being required to do this by companies, by websites, and the like, something known as 2 factor authentication, which requires that in order to authenticate to a system, you don't have just one, but indeed two factors to prove who you are. Now these factors are fundamentally different. It would not suffice just to have two passwords instead of one, but rather two fundamentally different types of factors. For instance, One of the factors is something you typically know, so a password that you've long used, but the other factor is something typically that you have, for instance, a phone in your pocket, a little key fob on your keychain, or the like, to which you receive a unique code, usually a 6 or so digit code that you also have to type in order to access that website or application. The code is constantly changing every minute or so. But somehow stays in sync with the server that expects the code. The implication of this is that yeah, you still need your password, and that could be compromised if somehow someone figures it out or a database is hacked into that might leak out there. But now in order to get into your account, the adversary needs that second factor, not just knowing your password, but having something that's physically typically on. You, which narrows the threat from like literally anyone on the internet, anyone anywhere in the world, to just the people in the same room or building with you, the same people in the Starbucks or airport wherever you might be, which might still be a genuine threat, but a far lesser threat because there's just so many fewer people surrounding you as potential adversaries. Now here too there's some added frictions and as always this trade off. This sounds like a really good thing when it comes to your security, but what's the downside? Well, one, it's like literally a second step in the process. Now you have to take out your phone, find the key fob, type it in. What if you don't have signal? What if your battery has died? What if the phone is in the other room? It does add genuinely some friction that might discourage you from using it. And so Here too is this constant struggle between like IT administrators and actual users between what makes good sense theoretically, but practically and striking that balance is all a matter of finding good corporate IT policy. Alternatively though, your second factor could be something you are biometrics like your fingerprints, your face, your eyes, or something else. But what about some other threats now that arise from using all of these passwords all over the place? Well, it turns out, That when you visit a website or an application, if you've forgotten your password, like almost always there's some kind of link or button you can click to tell the site, sorry, I forgot my password. Now in the worst possible situation, that website or application will just email you your password to whatever email address they have on file for you. Now I say that's the worst possible scenario. Why? Because if they can just email your password, that means They know what your password is because it came from their database. Now maybe that's not a big deal and they should know your password because you chose that password for that system. But again, you and I as humans are not very good at picking these things in the first place, and odds are that password statistically is the same password you use elsewhere or similar to a password you use elsewhere, or maybe even has some kind of personally identifying information that's useful to you, but not really something you. want that application or website to know. And so no good comes from the website storing your password in a way that they can see it too and therefore send it to you. So if you ever do receive a password reset email that contains your actual password that you just happen to forget, that is a bad sign, and you should probably steer clear of that particular service because it suggests that they are not adhering to best practices. So what are the best practices when it comes to storing these passwords, server side, for instance, in the database? That's maintaining the account for which you registered. Well, we revisit the notion of hashing, and remember that hashing was an operation that we explored in the context of hash tables and the implementation of things like dictionaries in the context of data structures more broadly. When we hashed a value, we for instance looked at it, maybe a string, looked at the first letter of it, and if it started with a, maybe we went to the first location in an array or our hash table.
Segment 7 (30:00 - 35:00)
If it started with Z, maybe we went to. Last. Well, you can come up with any number of hash functions, but the key design is that you take this seemingly infinite domain of inputs, strings typically, and then map them to a finite range of numeric values indicating exactly where you should put them or string values that themselves are somehow unique. And indeed this is the context in which we'll see hashing when it comes to storing passwords. For many years, from the early days of computing, it was very common that somewhere on a server there were Just a simple text file that contained a whole bunch of usernames and passwords. Now, ideally those passwords were not in the clear, clear text, so to speak. Ideally they would be hashed so that what you see in the file is not that Alice's password is Apple and Bob's password is banana. You would instead see something seemingly nonsensical that doesn't quite read as the password itself. So in general, this is the process of Hashing a password. You've got the password itself, which might very well be apple or banana. You want to get out a hash value which in this case is not going to be as simplistic as a number from 0 to 25 or 1 to 26. It's going to be a string of text instead according to some algorithm, and the hash function is the algorithm inside this proverbial black box that's going to turn this input into this output. So for instance, you might type in Apple in the case of our discussion of hash tables, and you might get back one. But in the case of passwords, you don't want to just spit out a very easy to guess number. We want it indeed to be a string value, so Apple might instead hash to something that looks completely nonsensical like this, but that is the result of applying a well-defined hash function, a bit of code that converts this input into this output. Meanwhile, in the past when we've hashed on Banana, looking only at the second letter, if we use a more sophisticated hash function now designed for passwords, Banana might look a little something like this. Completely different from the hash value of Apple but seemingly nonsensical too. It's hard to imagine seeing that if you're an adversary who's hacked into some system and found this text file and imagining that it maps back to Banana. And indeed that's one of the goals of hashing in the world of passwords is that the hash functions should be one way, so to speak. That is to say you could run the same input through the hash function again and again, and it's always going to give you the same output. Deterministically, but it should be a one way process such that if the adversary or anyone has access to the hash value, they should not be able to send it back into that algorithm and spit out the original password. The mathematics involved should ensure that just isn't possible. Now in practice, as cryptic as those hash values are, they're even much shorter than you would see on an actual system nowadays. So you certainly wouldn't see that Ellis's password is Apple and that Bob's password is banana. Instead, you might indeed see some hash values like this, but this is really from yesteryear. Nowadays we might use even longer hash values that are statistically even harder to extract any information from or reverse engineer. So in practice, Alice and Bob might actually have a hash value that looks. quite more like this, and I challenge you to figure out that one came from Apple and the other banana, given the complexity alone of these hash values here. Now unfortunately, hash functions alone and storing password hashes in systems isn't necessarily a solution to all of our problems, even if the hash functions are one way and there's no way to go from the hash value back to Apple or back to banana. Well, you could certainly in advance calculate the hash values of like all English words, for instance. Go work your way through a dictionary using a for loop in Python and figure out in advance the hash value for apple, for banana, for cherry, for any number of other words in that dictionary, and then just keep them around, store them in a spreadsheet, database such that you've pre-computed all of those hash values thereafter. When you, the hacker, encounter some system on which you've accessed the password file that contains not the actual passwords but the hash values, well, you could use what you've created, otherwise known as a rainbow table, whereby you can look up the hash value you're seeing on the system, see if you've ever computed it before, because if you have, you just look in your little cheat sheet there and figure out, well, what password or what word from the dictionary did I hash to get that value and you can reverse the. Process sort of out of band. In other words, given the hash value alone, you just cannot reverse the process. But if you take the other approach, hash all possible dictionary words and maybe combinations thereof, you can come up with a cheat sheet, maybe a long cheat sheet, but if you've got enough memory in your system or database, that's OK. You can maintain that so-called rainbow table and use that to wage an indirect attack, if you will, on these hash values. So that alone might not be sufficient defense unless
Segment 8 (35:00 - 40:00)
the hash values you're creating are so long and the passwords your users are using are so complicated that it's just not going to be worth the adversary's time or expense or memory to precomute a massive number of millions, billions, even quadrillion possible passwords. There's just not enough time in the day to do that or enough space in their own system. So that rainbow table, quite simply akin to a spreadsheet might just have the original password or word or words that they used as input and all of the corresponding hash values thereof. It's that relatively simple of an attack. So the only thing really protecting us at the end of the day is the size of these passwords and the size of the space in which these passwords live based on, for instance, the complexity that your system requires. Here though is another concern. That we'll see if we have some more users in the mix. So Alice's password is still Apple, Bob's password is still Banana. Carol's password, I claim is Cherry, but so is Charlie's password as well. Now it's not going to be obvious from looking at the file, assuming it's the. Values therein and for the sake of clarity, we'll use the smaller has hashes for now, but you can imagine these being much longer. But the key detail here is that well Alice's hash values seemingly nonsensical. Bob's is as well. Carol's too at first glances, but wait a minute. When you look through the list, you can infer that, well, I don't know what Carol and Charlie's password is, but their hash values are identical, and this might leak information to me, the adversary. In other words, if maybe Carol and Charlie have something in common, maybe their brother and sister and had a childhood pet, maybe they both. Use the same word that they have in common as their password, so I'm going to focus my attack there or some other relationship that might exist between these two users, you're leaking information seemingly unnecessarily because you're telling the adversary you might as well start with some heuristics as opposed to resorting from the get-go to just brute force, which might take, as we've seen quite a bit of time. So how can we mitigate this concern whereby the same password will naturally yield the same hash value but that in and of itself would seem to leak information. Moreover, by transitivity, if somehow or other Carol's password is discovered, well, any smart adversary is going to realize, now I know what Charlie's password is as well, with no additional effort other than comparing these hash values originally. So here's how we might address this concern. Part if the password in question is cherry, and we're using that as input in the hash value by default is this hash value here, what if we instead sprinkle a little bit of salt in there, so to speak, perturb the algorithm, so to speak, in a way that we're adding a little bit of noise to it. In other words, instead of passing as an input, both the password alone cherry, we would also pass in some other value, usually a couple. Of additional characters that might be as simple as a or a B or a C or ZZ or even a longer salt instead, but we choose that in advance, sort of sprinkle it into the algorithm so that the hash value in the end is perturbed such that if we pass in cherry as input and say a salt value of 50 arbitrarily, well, maybe the hash value ends up being this instead. Now notice, and this is not leaking information per se. 50 is the salt, and that's deliberately captured in the hash value because that's going to be useful when it comes time to check the user's password and we'll see in a moment what that means. But if instead for not just Carol but Charlie, we use a different salt value, for instance, passing in not 50 but 49, notice that. The resulting hash value will be completely different, not just the 50 changing to 49, but the rest of the hash value just changed as well. so long as we have sufficiently many salts available to us, enough range of values, we can now allow users to have the same password, even unbeknownst to us, and so long as we use different salts for each user, the resulting hash value should look entirely different. But what's the implication now for checking passwords? And indeed, how do we come full circle to now when I was trying to log in the first place, at which point I forgot my password? Well, typically in a well designed system. The way you are logged in is as follows. If you type in your username and your password, those values are indeed sent from your computer or device to the server on their back end. They've got some code there that receives that username and password, and they don't just look up in their database or in their text file that username and password. They first run your password as you typed it into the website through. Their well defined hash function and then they compare your username and that hash value to what they have in their
Segment 9 (40:00 - 45:00)
database or in that text file and if both match, they allow you to proceed. If though they see. That if they are using salt values in this way, what they can do when checking passwords is try to add those salt values to your passcode, run it through the hash function, and then compare those resulting hash values, including the outputted salt, against the list of hash values in their database or text file. In other words, you use and reuse the salt values again and again to make sure that the text you're comparing will indeed output. To be the same. It's not the raw passwords or clear passwords that are compared. It's ultimately these hash values themselves unless you discover one of these websites or applications that is in fact able to email you your password, in which case they're not doing any of this most likely, no hashing at all. They're probably just storing in the clear your password in the system. Now, whereas these hash functions are one way by design, related in spirit, but even more powerful is the art and. Science of cryptography, which allows you to scramble information in such a way that you can also reverse that process. You can not only encrypt information from clear text into what we'll call ciphertext, you can also decrypt that information from ciphertext back into plain text, and this is useful for secure communications between two parties. So what do I mean by this? Well, suppose that you want to send a message to someone else, but you want to be able to use an insecure medium. You want to send this out over the postal service is. Paper-based mail. internet and someone might be eavesdropping wirelessly or on wires. How nonetheless can you ensure that the recipient can receive the message and read it, but no one else in between? Well, what you can do is this. You can take your original message, AKA plain text, and you can choose in advance a secret key, so to speak, which is typically a number of some size. You can then pass that into a function known as a cipher. This here then is our algorithm, the output of which is going to Be so called ciphertext, which is text that is seemingly random but with that same secret key can be reversed back into the plain text. Indeed, to decrypt that process of encryption, you would pass in the ciphertext, that same key into the cipher and get back out the plain text. So the first process is what you would do to send the message. The second process is what the recipient would do upon receiving the message, that is to say, decrypt it. This is generally known as secret key cryptography, whereby The intent is to maintain a secret between two parties and keep that thing secret so that you can have communication back and forth by encrypting and decrypting your messages. This is otherwise known as symmetric cryptography as well, insofar as you're using really the same key in both directions to do that same process. So for example, if you were to try to encrypt the message A that you want to send someone securely and you choose a key in advance, a secret key, That is the number one. You could pass them both into your cipher, which would output thereafter a value that's the result of somehow using both. So for instance, if the key is 1, a cipher might be simply to rotate the letters of the alphabet by that many places. So if you input A and 1, the output will be B. If you input B C. If you input C and 1, the output will be D and if you input Z and 1, will wrap back around to A. So this is A rotational cipher, otherwise known as a Caesar cipher, and it's relatively simplistic, especially if we're using such small numbers, but you could imagine using perhaps fancier mathematics and not just rotating the letters but changing them somehow, maybe leveraging their bitwise representation, those zeros and 1s, in order to get back a result, aka cipher text that you could eventually pass back through this same algorithm with the same key or effectively negative one and get back the original text as well. So the upside of this is the relative simplicity. So long as you and the receiver both know and have that secret key, you can do this all day long back and forth, and only if someone in the middle intercepts this message and knows your secret key or figures out somehow what it is, can they too see your messages. Now of course in this scheme you Probably don't want to use a key as simple as 1, and heck maybe not even 2 or 3 or 13 or something else. You probably want to use a fancier cipher altogether, but the idea is going to be the same with symmetric key cryptography, you're still using an input and a key, and the recipient is using in some form that same key. All right, but what's the problem with that system because it sounds kind of straightforward and great. Well, how does the recipient know what key to use? Well, you and they just have to agree in advance what the key is. Well, wait a minute, how are you going to agree in advance what the key is if you need to discuss that securely and maybe the other party is halfway
Segment 10 (45:00 - 50:00)
across the country and so you can't really call them about it. You can't send them an email or text message. Why? Well, if those aren't going to be encrypted, you're just Telling the whole world what that key is, but if you are proximal to them in person, you can just tell them the secret key. Well, heck, whole message at that point. You don't need to establish the secure channel if they're right there. So there's sort of this chicken and the egg problem whereby to use symmetric cryptography you need to establish a key in advance but securely, but you Only do so if you already have a secure channel, for instance, one on one in person, which is not realistic. For instance, in the real world, if I want to buy something from like Amazon. com, I don't necessarily know anyone personally at Amazon. com, so I certainly can't come up with in advance a secret key that we can use to keep my credit card information and my password secure. So how do I buy something from Amazon. com? Well, it seems that symmetric cryptography is not going to be the solution for us, but asymmetric cryptography does offer a solution. And as the name implies, the process of encrypting and decrypting in this world is going to be a little bit different. Now asymmetric cryptography, otherwise known as public key cryptography, looks a little bit the same but uses different keys. In fact, if you want to encrypt a message using public key cryptography, particularly to exchange a message with someone like Amazon. com, with whom you have not established in advance a prior secret with which you could use symmetric key cryptography, what you can do is this you can generate in advance using a well documented mathematical algorithm, a so-called public key and private key, a pair of two values, essentially really big numbers that have an inherent mathematical relationship with each other based on the algorithm you're using. For instance, one such algorithm is famously known as RSA. And using your public key and private key and in turn the recipient's public key and private key, can you both begin to communicate securely. In particular, if you want to send a message to someone else, for instance, Amazon. com, you can download their public key, which by definition is public. It can be posted on their website. in the footer of every employee's emails. The public key indicates that indeed this is meant to be used by anyone out there. And you can use that public key passing in also your plaints to whatever algorithm you're using RSA for instance, and output a ciphertext. But the interesting thing about public key cryptography here is that now the recipient of that message, Amazon. com in this story, theoretically should be the only one in the world who can decrypt that message. Why? Because Amazon hopefully is the only one. In the world that has the corresponding private key, the other number that has some inherent mathematical relationship with the public key, such that one encrypts and the other decrypts. One reverses the effects of the other. So if you then as Amazon were to pass in your private key and that cipher text you received, you should be able to get back the plain text message from the original sender and Same process can work in the other direction. If Amazon then presumes to download your public key, they too can send a message in this way. But in practice, the way this typically works, at least in the context of browsers and something like HTTPS nowadays, where the S means secure, is that your browser will indeed establish a secure connection somehow with a website like Amazon. com using public key cryptography, but they'll use this public key cryptography, which in Practice it turns out tends to be a bit slower mathematically and therefore more time consuming and more costly. They'll use this to send not the actual message you care about like your credit card information or your password, but to exchange and somehow come up with an actual shared secret, some other third big seemingly random number that they can subsequently use using symmetric cryptography. Which is much more efficient and therefore less costly computationally in practice. So you can use these both together, but it's the public cryptography in particular that helps address that chicken and the egg problem. And indeed this is the S in HTTPS. It's the combination of algorithms like these that actually enable those secure communications. Now we can actually use public key cryptography to address an earlier problem involving passwords. With passwords, of course we had this burden of generating them, managing them, and then using them just to log in. But we have now an alternative in the form of what are increasingly called pass keys, which some websites, some services now support. In the world of passkeys, you, the human, don't have to bother generating a password and then store it on the server and then use it thereafter, which Again, it is really putting the burden on you and me for our form of authentication. With passkeys, we essentially let software and our own devices do it as follows. When you seek to register for a website or service for the very first time
Segment 11 (50:00 - 55:00)
your device or your browser will generate a public and private key pair that, as before, have a mathematical relationship between each other such that one effectively reverses the process of the other. But we're going to use this public and private. pair in a different way. We're going to send the public key to that server for which we want to sign up for an account, and it's going to keep that around for me. It's then going to send me some kind of challenge, like a big random number or string value that I'm going to use as input essentially to an encryption process as follows. I'm going to take that challenge and my private key and essentially encrypt the challenge with my private key. Technically refer. To as digitally signing the challenge, I'll then send the result, my so-called digital signature, up to the server. The server, upon receipt of my signature, can use my public key, apply it to the signature, essentially use it to decrypt that signature, which should spit out ideally the original challenge. And again, thanks to the mathematics underlying public key cryptography in this way, the public key and private key have an inherent mathematical relationship. That ensure probabilistically that those are the only two values that can perform these operations for me. And so subsequently, the next time, the next time that I visit that same website, my device, maybe with my blessing or approval, can simply await a challenge from the server, use my private key as before, digitally sign that challenge, send it up to the server and the server again using my same public key can confirm, yep, mathematically this is the same user as. Registered for this Main at Harvard. edu account even though he is now password list using pass keys instead of a traditional password. And indeed if you sign up for services nowadays and see a password list option, odds are underneath the hood it's using a little bit of public key cryptography and some sophisticated mathematics to at least take you and me, which have really been the source of most of our problems, out of the loop and allowing the two computers to communicate on our behalf. So how else can we use encryption, this scrambling information to keep ourselves ever more secure? Well, it's not always a feature offered, but increasingly do a lot of services offer what's called end to end encryption. And the reason is this in the world of, say, the web using HTTPS, which keeps the connection secure, you might very well have a connection between you and that server by design and other users. Might have encrypted connections between themselves and that server. Suppose the server is Gmail. This means that I can send an email from my device to Gmail and it can be kept secure in between. No one in between me and Gmail can actually see that their message, anyone who receives that message can log into Gmail, have an encrypted connection there, and similarly keep that message secure. The catch though is that the man in the middle, the machine in the middle, that is Gmail and all of Google's employees, theoretically could have access to that same email. Now hopefully there's technological defenses and authorization flows and just corporate policies in place that largely prevent that, but I do not have a property called end to end encryption because that email is not secure between me and the. Final recipient, it's sitting on a server between which you and the other person have a secure connection, but that doesn't mean that the data is secure from prying eyes in that middle step. Indeed, this is what really speaks to our privacy because you can have security but not necessarily privacy. I can have a secure connection indeed to Google, as can you, but our emails might not necessarily. stay private if that machine in the middle, Gmail itself, technically has access to all of our data therein. So these are different properties that we might want to achieve, but with end to end encryption, we can guarantee this a bit better. In fact, various services nowadays like WhatsApp for instance and iMessage and Signal and other devices and services offer a stronger guarantee of privacy by way of. To end encryption and what this means is that even if your data is traveling through some third party server like Apple's, the mathematics involved ensure that your device will encrypt the message in such a way that only the final recipient can decrypt it. And even though that message in encrypted form is going through Apple's servers or anyone else's, the reality is that only the recipient can access it. And ultimately read it. Now the machine in the middle might be able to see the ciphertext, the encrypted form, but to them it might as well just be random zeros and ones because if they don't have access to the keys that you are both using as part of this end to end encryption, it's just going to look like random noise to them. And this notion of random noise comes into play in another scenario as well. Let's consider the security, not of our systems per se, but now of our actual data on those systems. It's quite common, of course, to occasionally want to delete some files, maybe some document you wrote or some file you downloaded stands to reason you occasionally might want to
Segment 12 (55:00 - 60:00)
delete it either because you don't want it anymore or just want to free up the space. Unfortunately, historically, deleting a file itself has not really been a secure. Operation. Typically when a computer deletes a file, it effectively just forgets where it is. It loses track of what file name refers to what zeros and ones are inside of the device, and that's largely for historical efficiency reasons. It's a lot faster to just forget about the bits and then reuse them later than worry about changing them so that the actual data, the patterns of zeros and ones, are no longer there. For instance, consider. pattern of zeros and ones that might represent some large file on your system. Suppose that you decide to delete a file, for instance, these zeros and ones here. Well, the computer can indeed just simply forget what those bits were previously used for a resume, an Excel file, or anything else and then gradually just let the computer reuse those zeros and ones and gradually overwrite them with new data. So that is to say when you delete a file represented by these. zeros and ones here, then the computer effectively has more space. Like the zeros and ones don't go away. Like you don't delete the actual zeros and ones because then you'd constantly be losing space, but you can mark them for reuse by some other files. The next file you download, and in fact maybe download only needs a subset of those bits, maybe just these here, such that these zeros and ones here still in white are still remnants of your original data, the original file. That you deleted, but perhaps enough that you're going to leak information. In fact, if a, if a researcher, if law enforcement were to gain access to this device, they could reconstruct here part of the file and maybe see some of the numbers that were actually in that Excel spreadsheet of yours or something else. So really deleting a file doesn't mean that the data is gone forever. It's really just forgotten but often recoverable and Unless you really wipe or sanitize the device that's storing the data, and secure deletion therefore would mean actually doing a bit more work to ensure, at least with high probability, that those zeros and ones cannot in fact be recovered in the same way. So if here again is your original hard drive inside of your computer with all of its zeros and ones and you decide to delete some big file here, well, you should probably take in a Additional moment to change all of those bits to zero because now you've effectively securely deleted the file because I have no idea henceforth which of those zeros and ones were actually ones because everything now looks like a zero. So secure deletion might do this or you could better yet just completely randomize it. So it's not all zeros, it's just completely random zeros and ones. So for all intents and purposes it is indeed now just noise. The funny thing though about this approach, which is very reasonable and which can be considered a good practice depending on the hardware that you're using and the system you're using it with, isn't necessarily the most efficient, especially with today's solid state drives and other devices which technically have a fairly finite life such that the more you From and write to them, the more wear and tear you're essentially putting on the electronics and eventually those zeros and ones might not be good anymore and over time the device might say, ah, you can't use those zeros and ones anymore, at which point some of your data might still be there. It just becomes inaccessible to you and you have less space to work with. So increasingly common is to turn on what's generally called full disk encryption, which is a use of encryption on your entire hard drive, like all of the files on your device. The upside of which is. That all of the data on your computer, at least while you're logged out or the power is off, is completely encrypted. That is scrambled with some secret key that's maybe in the form of your password or protected with your password, something that only you control. So instead of the zeros and ones on your computer looking a little something like this, they will, when full disk encrypted. Look completely random and only once you log in with your appropriate password does the computer decrypt those into the patterns of zeros and ones that are actually useful, the Excel files and other files that you might have on your hard drive. The upside of this is that if your device is ever stolen, for instance, so long as it's not powered up and logged in. The adversary, if they're trying to get at your data, essentially is only going to see random zeros and ones unless they somehow gain access to or figure out your password. So you've effectively securely deleted all of your files just by not telling the adversary what your password actually is. The flip side is if you have a password that's especially hard to guess, which is good, but you yourself forget it at some point and therefore can't. Log into your device, you've effectively wiped your hard drive as well, securely deleted everything because that password is the key to decrypting all of your data. But this is helpful in more innocuous forms when you might want to trade in or sell a device like a phone or a laptop
Segment 13 (60:00 - 62:00)
whereby you want to be able to trust that the other user can't gain access to files that you might have. Once had on the hard drive. Well, if you've been keeping it securely full disk encrypted this whole time, so long as you don't give them the password for all intents and purposes, you're just giving them a hard drive with random zeros and ones, and you don't even necessarily have to worry about changing them all to zeros or completely random altogether. And in fact, if you take an iPhone or an Android device and you wipe it, so long as you're using a power. password in the first place. This is typically how it is wiped for trade in or for sale by just really sanitizing the secret key that was used to encrypt all of that information as opposed to all of the information itself very painstakingly and often slowly changed to all zeros or randomness, which is ever more of a slow process on larger devices, laptops and desktops that have even more space on them. But here too we have yet another trade-off whereby these same features can be used for evil as well, and ransomware, for instance, is an application of these basic ideas for ill purposes whereby if an adversary breaks into your system, they can actually use encryption to scramble all of your most important files, not. Tell you the secret key unless you pay up some ransom, and this has happened all too often in the real world where some company where some hospital where some system is infiltrated somehow the systems themselves are left online, but all of the important data is indeed encrypted unless someone pays up using cash or Bitcoin or some equivalents. And so these. things that keep us safe can also put us at risk and how you defend against this will not only adhering to best practices across the board, as we've discussed already today, but also having backups of all of your most important data, offsite backups that aren't networked no less so that when the adversaries do encrypt things maliciously, you don't back up the encrypted versions and then have nothing but. Just the encrypted form. Today then was all about securing systems. So what can we do personally and professionally in the days ahead? Well, minimally start using a password manager and or pass keys were available. Start using two-factor authentication, if not for all of your accounts, at least for the most important thereof, and seek out and use. Use end to end encryption when you can so as to keep your data not only secure but private as well. Ultimately we're not going to be able to keep our systems absolutely secure, but through these lessons learned through these best practices, we can at least aspire to make our systems and data relatively more secure than everyone else. This was computer science for business.