# The Algorithm Behind Spell Checkers

## Метаданные

- **Канал:** b001
- **YouTube:** https://www.youtube.com/watch?v=d-Eq6x1yssU
- **Источник:** https://ekstraktznaniy.ru/video/31474

## Транскрипт

### Segment 1 (00:00 - 05:00) []

I'm sure we're all very familiar with the squiggly red lines that appear under misspelled words when we're typing away at an essay or an email but if you ever stop to wonder how such a feature is implemented I mean sure it's a simple feature that we probably rarely think about but believe it or not it's all thanks to an incredibly brilliant algorithm working under the hood to help keep our communication clear and professional the Genesis of spellchecking systems can be traced back to the late 1950s and early 1960s notably with the pioneering concepts of Herbert glance in 1957 and the subsequent development by Charles Blair in 1960 their idea was simple each word typed by a user was compared against a list of correctly spelled words flagging those not found in the list though they lacked the ability to actually suggest correct spellings these were crucial first steps for spellchecking technology this early work set the stage for a significant advancement in the early 1970s when Ralph gorin developed a more sophisticated program known as spell Ralph's program was a major advancement over its predecessors while it still identified misspelled words it also suggested a list of plausible correct spellings words considered plausible were those with an edit distance of one meaning the difference could be a single letter insertion a single letter deletion or a substitution of one letter for another Ralph published his program in 1971 making it publicly available and quickly spreading its use worldwide at the time this was an incredibly useful program and was a massive leap in spellchecking technology but there was only one glaring problem not all spelling errors would have only one edit distance while this remarkable achievement certainly deserves recognition it's important to acknowledge that there was still potential for further enhancing spell Checkers we needed to find a way to calculate plausible words that are Beyond one edit distance but how do we even quantify edit distance well luckily at this time an algorithm had already been discovered to do just this just 6 years prior in 1965 Soviet scientist Vladimir Lenin published his famous Lenin distance algorithm while it wasn't initially designed for spellchecking this would soon be the missing piece of the puzzle to further the development of spell Checkers now I know this looks like a JBL mess of confusion at first but we're going to step through it to show that it's actually quite simple and elegant to demonstrate I'm going to use it to find the edit distance from the word hat to can and just like Ralph goran's program we can pick from only one of three edit moves with which are deletion insertion or a combination of the two substitution now you can probably intuitively guess that the edit distance between these two words is just two I mean we just substitute H with c and substitute t with n and you're correct it is in fact two so let's keep that in mind so that we can check the answer of this algorithm so to start let's plug in our two words into the algorithm now starting with our first conditional if the length of can is zero we return the length of hat well the length of can is not zero so we move on similar condition here just the words are swapped the length of hat is not zero so we move on our next conditional says that if the head of hat is equal to the head of can in other words is the first letter of these two words the same and they're not so we move on now that we've essentially gone through the guard Clauses of this algorithm we now know that a move must be performed so we add one to our edit distance next let's clean up what we have here in these square brackets and you'll see here that we have three possible routes we can take we can delete a letter insert a letter or we can substitute a letter since we don't know which of these is the best route we run all three of them recursively and choose the route that Returns the smallest edit distance since we were able to intuitively solve this earlier we know that our first move is substitution so let's follow that route again we go through our guard Clauses the length of a n is not zero so we move on the length of a is not zero so we move on next we get to this conditional if the head of a t is equal to n which it is because they both start with the letter A so here we don't perform any move since they both have the same letter and hence we don't add anything to our running edit distance let's simplify this call to the function again so now we're inputting the tail of a t and n which is just t and n so going through this once more the length of n is not zero so we move on the length of T is not zero so we move on the head of T is not equal to the head of n so we move on so now we know a move must be performed so let's add another one to our edit distance again we've got three possible routes to take you'll see that

### Segment 2 (05:00 - 10:00) [5:00]

when I clean up these inputs each of these routes has at least one empty string as an input which in the next recursive call would get handled by these conditionals if we were to perform all three of these we would get one and zero and we take the minimum of those which is zero so there we have it using the Lenin distance algorithm we were able to calculate the edit distance from hat to can that wasn't so bad right except it kind of is looking again at this algorithm remember we're calling the function recursively three times every iteration this gives us a Time complexity of Big O of 3 to the nth power can you imagine the amount of recursion involved when Computing the edit distance between longer words it's incredibly inefficient but to Lenin's credit he didn't initially formulate this algorithm to be used as a computer program it was more of a mathematical concept now earlier I noted that this algorithm was developed six years prior to Ralph goran's program I had asked myself while doing research for this video why Ralph decided not to use Lenin's distance algorithm when creating his program I could only really speculate that either he was unaware of Lenin's distance algorithm at the time or the more likely scenario is the fact that the Lenin distance algorithm requires significant processing power and memory due to its recursive nature so it's not a practical algorithm to implement programmatically that was until 1974 when Robert Wagner from Vanderbilt University and Michael fiser from MIT introduced a groundbreaking solution to the inefficiency of the Lenin distance algorithm instead of relying on the Intensive recursive calls that characterize Lenin's approach Wagner and fiser utilize the technique known as dynamic programming which involves breaking down a complex problem into simpler sub problems solving each of these sub problems just once and storing their solutions by doing this the algorithm avoids the excessive repetitive work seen in Lenin's recursive approach to demonstrate this elaborate solution I'm going to find the edit distance from boats to float now to break this problem down into smaller sub problems the Wagner Fisher algorithm suggests adding these words to a matrix like so this will allow us to determine the edit distance between substrings and build off those calculations as we go so again the edit steps we can perform are deletion and insertion and substitution okay so first starting with this top left cell we're asking ourselves which of these operations must we perform to transform this empty string into an empty string well none so we put a zero here next we ask what operation must be performed to transform an empty string to the string F we would use insertion here to insert an F so that's one operation so we put a one here moving on what operations are needed to transform an empty string to this string f FL L that would be two insertion operations and you'll see that we can build off our previous calculation if it took one insertion to go from an empty string to F then it only requires one more insertion operation to get l so we put two here same with transforming an empty string to FL three insertion operations so we put a three here and you can probably see that as we keep going across this first row it's just one additional insertion operation until we get to the end okay so next let's fill out this First Column going from the string B to an empty string takes one deletion operation this time so we put one here next to go from the string Bo to an empty string again just like before we can work off our previous calculations if it took one deletion operation to get from B to an empty string it would only take one additional deletion operation to go from Bo to an empty string and again the same pattern appears we can just fill out the rest of this column following suit and in fact this will always be the case with the first row and column always being filled like this okay so now let's get on to the meat of this Matrix here we're asking ourselves what operation must be performed to transform B to F in this case we use one substitution operation since we're substituting we build off the value that is diagonal in the top left of the cell so we add one to zero so we insert a one here okay now we need to determine how many operations are needed to transform B to L but instead of sitting here and thinking about which operations we should use what we're going to do instead is look at these three cells all we need to do is take the minimum value of these three cells and add one to it so here we take one and add one to get two so we can use this trick to just continue filling out the rest of the Matrix except you got to be careful when you get to a scenario like this where

### Segment 3 (10:00 - 13:00) [10:00]

both letters are the same no operation is needed when they're the same so here we just take the minimum value and don't add anything to it so two is the minimum value so we use that okay so now once the table is filled out we can look at this final value here in the bottom right corner and that is our edit distance to get from boats to float we can even use this Matrix to determine the oper ations that were used to transform them so from this bottom right cell which cell next to it is the smallest value well it's this two above it which means we delete the s from the end of boats the next smallest valued cell here is the two just diagonal from it the value is still two because no operation was performed here because you'll notice that both strings had a t in this position the same thing here both strings have an A in this position so the next smallest cell is diagonal so we can jump to it without performing any operation and again same idea with this o now at this point we could go one of two ways here I'm going to jump left and perform an insertion of L and finally we jump to the last cell to perform a substitution to replace the B with an f and so yeah there we have it so you can probably see how this is a much more efficient way of calculating the edit distance between two strings and it's simple enough to be implemented programmatically to demonstrate I created a python script that implements the Wagner Fisher approach so that we can compute the edit distance between two words so here I can input a misspelled word and the script computes the edit distance of that misspelled word against each word in a massive dictionary of correctly spelled words and it returns a list of words that have the smallest edit distance so if this were used in a spell checker these are the words that would get recommended to me now sure this is still an inefficient method for a spell checker I mean you're telling me that as a spell checker it takes each word that's type and computes its edit distance to every word in a dictionary there's clearly some improvements that can be made from here like maybe don't compute the edit distance immediately maybe check to see if the word exists in the dictionary to see if it's spelled correctly first before Computing its edit distance to other words maybe the dictionary should be pre-sorted in some way you can imagine where spell Checkers went from here popular programs like Word Check and Microsoft Word in the 80s and 90s utilize more evolved and optimized versions of these algorithms fast forward according to today as you can imagine many modern spell Checkers now incorporate techniques such as natural language processing to understand context of the sentence in order to suggest more appropriate word Corrections if you're interested in playing around with this Wagner Fisher algorithm I've uploaded this python script to my GitHub linked below I encourage you to find a way to continue optimizing the solution if you enjoyed this video be sure to leave a like and if you enjoyed content like this hit that subscribe button and consider checking out some of my other content thanks for watching and I'll see you all in the next video