In this coding challenge, I struggle my way through implementing a Naive Bayes text classifier in JavaScript using p5.js. I explain Bayes' theorem, demonstrate word frequency analysis, implement Laplacian smoothing, and build a working sentiment classifier that runs entirely in the browser. Code: https://thecodingtrain.com/challenges/187-bayesian-text-classification
🚀 Watch this video ad-free on Nebula https://nebula.tv/videos/codingtrain-coding-challenge-187-bayes-classifier
p5.js Web Editor Sketches:
🕹️ Text Classifier - Initial Version: https://editor.p5js.org/codingtrain/sketches/RZ8a1z4DN
🕹️ Text Classifier - Refactored Version: https://editor.p5js.org/codingtrain/sketches/P3ngrAANX
🕹️ Text Classifier - File Loading Version: https://editor.p5js.org/codingtrain/sketches/WowR2Q9xg
🎥 Previous: https://youtu.be/5iSAvzU2WYY?list=PLRqwX-V7Uu6ZiZxtDDRCi6uhfTH4FilpH
🎥 All: https://www.youtube.com/playlist?list=PLRqwX-V7Uu6ZiZxtDDRCi6uhfTH4FilpH
References:
📓 Naive Bayes Classifier: https://en.wikipedia.org/wiki/Naive_Bayes_classifier
📓 Laplacian Smoothing: https://en.wikipedia.org/wiki/Additive_smoothing
Videos:
🚂 https://youtu.be/unm0BLor8aE
🚂 https://youtu.be/7DG3kCDx53c?list=PLRqwX-V7Uu6YEypLuls7iidwHMdCM6o2w
📺 https://youtu.be/HZGCoVF3YvM
🚂 https://youtu.be/0Ad5Frf8NBM
Live Stream Archives:
🔴 https://youtube.com/live/TsBDm0P0qaA
Related Coding Challenges:
🚂 https://youtu.be/unm0BLor8aE
🚂 https://youtu.be/eGFJ8vugIWA
Timestamps:
0:00:00 Hello!
0:03:34 Explaining Bayes' Theorem
0:12:07 What is Naive Bayes?
0:13:49 Setting up the Classifier in p5.js
0:15:41 Coding the train() function
0:22:14 Coding the classify() Function
0:24:45 Revising the train() function
0:29:06 Implementing Probability Calculations
0:33:24 Laplacian (Additive) Smoothing
0:42:21 Ignoring the enominator (Normalization)
0:45:36 Quick User Interface
0:49:42 Final thoughts and next steps.
Editing by Mathieu Blanchette
Animations by Jason Heglund
Music from Epidemic Sound
🚂 Website: https://thecodingtrain.com/
👾 Share Your Creation! https://thecodingtrain.com/guides/passenger-showcase-guide
🚩 Suggest Topics: https://github.com/CodingTrain/Suggestion-Box
💡 GitHub: https://github.com/CodingTrain
💬 Discord: https://thecodingtrain.com/discord
💖 Membership: http://youtube.com/thecodingtrain/join
🛒 Store: https://standard.tv/codingtrain
🖋️ Twitter: https://twitter.com/thecodingtrain
📸 Instagram: https://www.instagram.com/the.coding.train/
🎥 https://www.youtube.com/playlist?list=PLRqwX-V7Uu6ZiZxtDDRCi6uhfTH4FilpH
🎥 https://www.youtube.com/playlist?list=PLRqwX-V7Uu6Zy51Q-x9tMWIv9cueOFTFA
🔗 p5.js: https://p5js.org
🔗 p5.js Web Editor: https://editor.p5js.org/
🔗 Processing: https://processing.org
📄 Code of Conduct: https://github.com/CodingTrain/Code-of-Conduct
This description was auto-generated. If you see a problem, please open an issue: https://github.com/CodingTrain/thecodingtrain.com/issues/new
#bayestheorem #textclassification #naivebayes #sentimentanalysis #naturallanguageprocessing #machinelearning #wordfrequency #laplaciansmoothing #javascript #p5js
Оглавление (12 сегментов)
Hello!
Hi everybody, welcome to a coding challenge video. Hi, it's me from the future in the editing booth. Of course, I'm not actually doing any editing. I have no skills there. That's Mat Ye, the long time Coding Train video editor, and I'm watching this back right now. And while I'm watching it, I'm thinking, let me give you a little heads up. context about what you're going to watch. So, I'm trying to get back in this fall into making regular weekly videos for the Coding Train. Every Monday I'm live streaming and we'll work through a topic, and I think I made a mistake which was picking Bayes' theorem and Bayesian text classification, which I don't know that well to be the first topic that I tackle. So, there happens to be a three and a half hour live stream where I struggle through all sorts of technical issues and all sorts of confusion, and I'm trying to explain Bayes' theorem multiple times. This is an edited version of that where I'm trying to piece together the full story, and there is a finished working version at the end. Now, there are a bunch of mistakes in that working version, which is why I have this editing booth. I'm going to pop in from this view and give some corrections along the way and a little bit of commentary. I have also made a organized with full comments, cleaned up, corrected version of the algorithm and the Bayesian text classifier that I'll also provide as a resource, and I will show it and talk through it a little bit towards the end of the video. Anyway, I'm going to stop babbling. Um stay tuned, lots more to come, and just watch watch me go. Here it is. Ah. This one, I don't know. I'm going to do it anyway. This is really a part four of a bunch of videos that I made many years ago, eight years ago apparently, where I looked at how to count the frequency of words in a document, did it in JavaScript, did it in Processing with some visualization, and then that leads to all sorts of other kinds of text analysis applications. Like, you could generate the keywords for a document by looking at the frequency of words, term frequency, inversely proportional to the document frequency. So, that was one example of taking word counting to the next level. The other example is applying Bayes' theorem. So, I've got a really challenging task ahead of me, which is to attempt to explain Bayes' theorem to you. Now, the good news is you don't really need me to explain it to you because there are many excellent videos on YouTube. This would be, I think, probably one of the best ones, if not the best one. I'm going to give you my own explanation. So, maybe stick with me and see how I do. And go watch that later. But, the point of this is not for me to give you the best, most comprehensive explanation of Bayes' theorem, but look at how to apply it in the context of word counting. The The most important thing though that I want to say to you, however, is you might be thinking, what's wrong with you? Don't you know we have AI now and deep learning and all sorts of giant transformer models to do classification with text? Yes, and I would actually like to look at how to do that, and I want to cover things like word embeddings and text embeddings and how you can use that for classification and sentiment analysis, all that. But, this is a classic algorithm, and I think it's such important foundational knowledge to these newer, fancier systems that exist today. Not to mention the fact that I've got the web editor here in P5. I'm going to write the entire Bayesian text classification algorithm just in a little P5. js sketch in the browser. We're going to run it. We're going to load text into it. It's going to work. I don't need a GPU or a cloud server or some pre-trained model
Explaining Bayes' Theorem
none of that stuff. All right, so let me first just write out Bayes' theorem for you. Actually, no, I'm not going to write it out. Let's look at a scenario that's going to be relevant to text classification. Imagine a giant library. And in this library, 1% of the books are sci-fi. That means 99% are not science fiction. I'm even going to I'm going to give you some hints along the way. I'm going to borrow from 3Blue1Brown's geometric visualization with my poorly drawn diagrams. So, these are all the sci-fi books. Obviously not drawn perfectly to scale, but that's maybe 1% of this big library of books. 80% of those books, the sci-fi books, have the word galaxy in the title. In the rest of the library, 99% of the books are not sci-fi. 5% of those non-sci-fi books have the word galaxy. So, this is a scenario that we can use Bayes' theorem to calculate the answer to a given probability question, which I'm going to ask right now. Given any random book that I pull down from this library that happens to have the word galaxy in its title, what is the probability that it is sci-fi? Pause the video, think about what your answer might be. What answer did you come up with in your head? But, this is a made up scenario. I'm not really trying to trick you, but Bayes' theorem really comes up often as an example of how people's understanding and intuition about probabilities can be way off. If you were to have gotten it wrong, you might overweight the fact that sci-fi books, 80% of them, which is kind of absurd. I mean, 80% of sci-fi books do not have the word galaxy in them. But, if that were actually true, you might think, this book has the word galaxy in it, the probability is like really high. I don't know, 70%? But, it's not. And the reason why it's not is let's keep going with my diagram here. So, remember, this is the little sliver of 80% of sorry, of one 80% of 1% of all the books that have the word sci-fi in it. Now, remember, 5% of all of the non-sci-fi books also have the word galaxy in it. This 1% is way too wide. So, I'm going to draw the 5% also too wide. Again, not drawn to scale at all. But, presumably, all of these books, 5% of all the non-sci-fi books, also have the word galaxy. So, if I were to pluck a random book that has the word galaxy in the title, it's got to be one of these. It's a much smaller chance that it's a blue book than a red book. What is the exact probability? Let's calculate it, and then we're going to go back and look how the Bayes' formula is actually the formula for the calculation I'm about to do. Let's say there are 10,000 books in the library total. 1% are sci-fi. So, 100 books are sci-fi. 80 of those books then have the word galaxy in them because that's 80% of 100 sci-fi books. How many non-sci-fi books are there? Well, 99% of the books are not sci-fi, so that's 9,900 non-sci-fi books. 5% of those, I wish I made it 10%. 5% of those non-sci-fi books have the word galaxy in it. So, how many of those? 10% would be 990, 495. 495 of those have the word galaxy in it. We're going to lose our drawing for a second, but that's okay. So, what that means is if I pick this random book with the word galaxy in the title, what's the chance that it's one of those 80 and not one of those 495? Well, it's 80 divided by the total number of books that have the word galaxy in them, 80 plus 495, and this equals because I did this math earlier before I got here, this is going to equal about 13. 9%. Okay. So, we worked out the scenario a little clunkily, a little awkwardly. You might want to pause, go back, do this yourself on pencil and paper, but hopefully it kind of makes sense to you. Okay, I get it. Like, we figured out what's the probability that a book with the word galaxy in the title is sci-fi based on the scenario. We actually used Bayes' theorem here. I just never called it that, and I never used any mathematical notation. So, this video really isn't about the mathematical notation and the theorem itself, but let's double back. Let's see how it actually is the same thing. So, I'm going to just move this completely off, and I'm going to write out Bayes' theorem for you. Okay, so first of all, what is this notation? This is standard probability notation. It's a little bit confusing. What it means is, given B, what is the probability of A? So, I might say like, given I just heard thunder, what's the probability it's about to rain? So, B would be the event of thunder happening, and A rain. So, we're looking at this Think about event probabilities. So, it's You could sort of think of it backwards, but you could also read it left to right. What's the probability of A given the existence of B? So, in this case, B would be given the word galaxy being in a title of a book, what's the probability of A, that it's sci-fi? And Bayes' formula, well, it's equal to the probability of B given A, times the probability of A, I think. Yes, of course, A, divided by the probability of B. And this actually makes total sense because that's what we just did basically here. We did it with actual numbers of books, but it works out to say like, oh, what is the probability that a sci-fi book has the word galaxy in it? Well, that was 80%. Times, what's the probability that it's a sci-fi book in the first place? Oh, that was 1%. Divided by Oh, we have something weird now. What's the probability that galaxy appears in a title of a book at all? Well, it's the sum of all the sci-fi books with the word galaxy and all the non-sci-fi galaxy. Well, that's what we did. But, we could make that a probability formula. It's the probability that a sci-fi book has the word galaxy in it, plus the probability that a non sci-fi book has the word galaxy in it, right? That's all of the books, the non sci-fi and the sci-fi total. What's the probability that any of them have the word galaxy in it? I believe if you do this math out now, which was Bayes' formula, you will get also 13. 9%. So, in thinking about this, there's a really key term that's essential to Bayesian probability. And it's this idea of prior probability. The prior probability is really what people will often forget about when trying to sort of guess what the probability of a given scenario is. So, the in this case, you might be thinking like, well, 80% of the sci-fi books have the word galaxy, so it's just really likely. But, if there's such a small chance it's a sci-fi book in the first place, then it's kind of unlikely it's going to be sci-fi. Uh there are other important terms here. Uh this is usually referred to as the posterior. You know, uh these are conditional probabilities, this syntax here. Uh what's the probability of A given B? This is sometimes referred to as the uh marginal probability, or I like to think of it as the evidence. So, I'm actually doing text classification here. What I'm saying is, I have a book, it has a lot of words in it. How can I classify this book as sci-fi or not sci-fi? And actually, like saying sci-fi or not sci-fi is too limiting. I could say like, okay, I have this book, it has all these words in it. Based on the frequency of all the words in the book, what's the likelihood it's a romance novel versus a thriller versus sci-fi? And we can actually use the word counts um in a bunch of existing sci-fi books, and existing romance books, calculate all those word frequencies, apply them with this formula to classify new text coming in, and that's what I'm going to build into the code. I could keep going with diagramming this out. I might have to come back to it as I'm writing the code, but let's stop there. The one thing though that I do want to mention is that what I will actually be
What is Naive Bayes?
implementing is known as naive Bayes. And what do I mean by that? Let's say instead of just having one word, I was going to say, "Long ago in a galaxy. " The idea here is, right, forget about that being in a title. We're looking at all the words and the number of times they appear in a given book. What I'm trying to figure out is, given this text, what's the probability that this book is sci-fi? What I'm actually going to do is chop this up, whoops, long ago in a galaxy. I'm going to split it up into multiple words, and when I start to do things like, well, what's the probability that a sci-fi book contains the word long, I can chain multiple events together, right? If I want to know what's the likelihood I'm going to flip heads of a coin twice in a row? Well, once in a row is one out of two. Twice two times one out of two, one out of four times. And I can do the same thing. In my Bayesian formula, which I'm trying to figure out this whole block of text, which category is it, I can chop it up into words and multiply the probability of each individual word appearing in that category together. So, I'm pausing because I never actually explain what makes the algorithm naive. The naive part is a big simplifying assumption. So, I'm going to treat all the words as a bag of words, and I'm going to assume that the individual probability of one word appearing in the text has no effect on the probability of another word appearing in the text. When the reality is, those probabilities would be interconnected. That's why it's a naive assumption. Okay, in the spirit
Setting up the Classifier in p5.js
of me making a toy demonstration that just implements the algorithm and runs very quickly and easily in the browser, I'm going to hardcode in a very tiny amount of data. For this really to be interesting and to work in any meaningful way, I would need to load much larger amounts of data from text files. I might even want to move this to like a server-side code to manage the loading of all that files. I would be thrilled to come back and talk about that or demonstrate that and include some examples in that direction. But, for now, I'm going to do it this way. Other thing I need is a empty JavaScript object which will store all of the various counts of all the words across the different categories. So, this is my empty JavaScript object that's going to have that in it. And now that I think of it, maybe I don't need these arrays to do this in such a hardcoded ways, but I want to say something like, classifier. train, and I'm going to get this sentence, "I am happy," from this array of positive data, and I'm going to train my classifier Bayesian model with the category positive. Uh it's not an object. So, I also should make like a class. Maybe I will. Let's do this. Forget about having it be an object. I just like called a function. I'm like, my brain has moved 10 steps ahead. Let's put all of this into another file. I'm going to do all this in setup right now. And my classifier object uh needs to store a bunch of things. It maybe should have a list of all the categories, and it unique words. Again, I might not use all these things. I'm just kind of literally doing this off the top of my head. And then, I need to have the word frequencies. Let's just start with this. Then, I'm
Coding the train() function
going to have a function called train. It gets some text in. The first thing I'm going to do is split that text up into words. I have a whole set of videos about regular expressions you could watch, but essentially, I'm going to split by anything that is not a word character, letter, or number. And the way that I do that is backslash capital W plus. That should split everything up. And I also I'm going to say, do I just say lowercase or to lowercase? To lowercase is the name of the method I want. I'm just going to make everything lowercase and not worry about that. Now, I'm going to go through all of these words. First, I need to check, is this word part of have I counted it before? If this. wordFrequencies. word exists, then this. wordFrequencies. word. count plus plus. This should be the count of the word across all the documents. And then, if it doesn't exist, this is fine. Let's do this. Then, the count then it's the first time. So, this is me This is my standard word counting algorithm. And again, all sorts of other ways you could write this. Just trying to be verbose and simple about it. So, if and I know I've spelled frequencies wrong in a bunch of places, if I found this word before, increase its count. Otherwise, if I haven't found it before, start its count at one. Now, if I found it before, I also need to determine if this. wordFrequencies. word. category exists, right? If it exists for a given category, then I also need cuz I need its total count across everything and its count per category. Again, one could be extrapolated from the other, but let's store everything right now. And I can't seem to spell frequencies. Word frequencies. Word. category plus plus. Increase it. Otherwise, it equals one. And if I haven't found it before, I also not only need to increase the start the total count, but start the count for that category. Again, redundancies, efficiencies aside, I can't wait to hear all of your comments about how to write this code better. I think this will work for my purposes. And again, I don't know that I actually need these separate lists of the unique words and the word frequencies. We'll get to that later. Okay. Now. Okay, so I'm just very briefly putting in the three one sentence from each of those hardcoded data sets into the train function, and then I'm going to just look at the classifier, and I need to make sure I add classifier. js to my index. html file. And by the way, I'm using um p5 2. 0. I'm not using any features of p5 2. 0 in this particular demonstration, but I'm starting to put that as my version of p5 in everything I do. And, you know, stay tuned and find all the videos on my channel about p5 2. 0 that may or may not exist yet. Oh. Okay. I see. I can't just make up a property where there's the value of word frequencies. word was originally null or undefined. So, what this actually needs to say is equals count. Like that's me in creating this initially. And the same thing. I could just put the value in there. Uh I see. Do I want a count or do I just want the number? You know what? I don't think I need the property count. So, let's actually set it equal to one, and then there's no need for this count property. I'm just increasing it. That's what I did in the first place. There was I only used dot count in the top. This is now correct. I could put a whole object there with the property count equal to one, but I'm just going to store the value one. I might regret that later. Category is not defined. Oh! No, I do need an object! Cuz category is then an object. Okay. So, this is now going to be an object with the count as one. So, count goes up. Hmm. Boy, am I doing this right? Should it be category word? I don't even know. What is this might all be a mistake. This is why this video is going to be 2 hours long. Maybe to be consistent, I should do this. And then, I should use dot count here. We're going to use dot count everywhere. So, now just to cover this again, this is This makes This probably makes your head hurt cuz it makes my head hurt. We're going to see the data structure that I've defined once it console logs correctly. And your head will hurt less. Oh my god. This whole time Ah. I'm passing in the text and the category. There is no category. Where do I get the category? I get the category here. Now, we're getting somewhere. This dot count should not be here. Okay, I've got it. Now, we can look at this. Look at word frequencies. Every single word, like let's look at the word "am". It is a total count of two times. In a positive text, it appeared once. And in a negative All of that work was for me to separate all of the data the input data into a giant list of words, know the total amount of times that word appears across all documents, as well as its individual counts for each category. Great. So, now that works, let's actually do this. For let data of positive data Again, this is very inefficient, but we're just going to do this. This let data of positive data train positive data of negative data train negative and neutral data train and neutral. Let's run this again and take one more look. Okay, the is in positive once, negative once, neutral twice. Awesome, by the way, is in positive once. It doesn't have a zero for the other categories, but that's fine. We'll be able to determine it's zero by knowing that it doesn't exist. Okay, now the next thing I need to do is
Coding the classify() Function
I need to classify a new phrase. Just looking at these, let's make something up. So, I need a function that's going to take in a new phrase and calculate that final probability for each of the categories according to Bayes' theorem. So, what do I need to do? Okay, remember, I have some input text. It's got a whole bunch of different words in it. So, what I need to do is look at those words one at a time and apply Bayes' rule. I need to say like, "Okay, if it's a positive category, how likely is that word long to appear? " Times the prior probability for the positive category. We've got a new classify function. I need to chop it up into words. I also need a object called results. And what this is eventually going to contain is the final probability for each category. So, what is the probability that a given word given a category, how often does that word appear? I'll call that the word probability. I need to keep track of some other things. the category counts. Let's do that separately. If this Hold on. I'm thinking through this. I need more information for my probabilities. I need to have the total number of documents. That's going to help me with my prior pro- probability of How What's the probability that a document is positive at all? And I need to have category counts, which is the total number of words in each category. No, the total number of I do need the total number of words in each category. Let me map out what I need. I need to know total docs the total amount of documents. I need to know the total docs per category. That is what I need for my prior probability here, right? I need to know What's the probability that any given document is a given category? So, that's what I need. Now, for words I think I have what I need for words already. Ah, I need the total words per
Revising the train() function
category. That's what I'm missing. Right? Because for the words, I actually don't need the total word count. This is what's going to get me this. In order to know What's the probability a given word is in a given category, I need to know its count. This is for a given word. I need to know its count per category and then I need to know the total words total all words per category. This is something I'm not storing. This is what I don't have in my code yet. So, I need to add that. Every time I call train the total documents goes up by one. If the category counts category exists already, then and I'm just going to uh I guess I'll do my dot count is one. No, increases. Otherwise, this is me just counting the frequency of the categories across the documents. Otherwise, it equals it starts with a count of one. So, this here this is for the prior probability. Now, I'm keeping track of this. Now, I have this words count per category. That is right here. The That's here. I also have the words total count, but that's not actually what I need. I need the categories total count. So, this should be reversed. I'm going to rewrite this with this new way of thinking. The first thing I need to check is does the category exist? So, this is going to end up being the same code. I'm just rewriting it again. Does the category exist? If it does the total count for that category goes up by one. Otherwise is a new counter that starts at one. Now, once we know the category exists, I have to check if that word I'm just doing this in the other way has been found before in that category. Has that word category? If so its count goes up by one. Otherwise it gets a fresh start count. What have I got wrong here? Oh. The issue is I was doing more stuff down here that I forgot about. The syntax errors are below. So let me comment that out. There we go. My code was right all along. No, no. No, the code is actually wrong. I've missed a critical detail. When a word appears for the first time in a new category, I make a new category property to count the total words of that category, but I don't count the first word. The result is the first word for every category is lost. So, this is a very easy fix. It just needs to count that word as well here in this else. And that's the corrected code. So, I don't actually fix this in the video. It works because I'm just off by one for a bunch of words and I get values at the end of the video and I never noticed this, but the code that I released with this video that's linked in the description will have this fix in it. Let's just take a look at the classifier with the new way I'm organizing the data. So, now just looking under category counts, I have three positive documents two negative documents, two neutral documents, seven total. That's right. Word frequencies, each category is the root property. And under positive, there's a total number of words in positive and "am" appears once. The probability of "am" appearing in the positive count is 1 / 15, which is exactly the math I need. Now, we're ready. I have the pieces. Just to reiterate that's the math I need to know how frequently a given word appears in a given category. So, I Both of these are the core probabilities that I need that map directly back to Bayes' formula. Oh, we are so close.
Implementing Probability Calculations
All right. For every word, I need the word probability. Ah, wait, wait, wait. I need to um first do every um every category. For I have all the categories in category counts. Uh wait, wait, wait, wait. So, I actually need to get this list of positive, negative, and neutral. Those are the keys to the category counts property. This is very awkward. I'm going to show you, by the way, I've actually written a version of this that's I think much better, but I'm going to go with my messy style. We'll talk about how to refactor this later. Play Q the refactors. So categories equal And although I guess what I could do, you know what? Let's I have a better idea. Let's put this this. categories back in. And then when there's a new category, I will just push it into that array. So, I'm going to store my own list of all the categories. Because now I first want to do for every category of categories, results index category equals some object. And now let's go over We're going to do these What is the right I'm not going to worry about the correct way to do this. I'm just going to do this a way that it will work. It will be redundant. Cuz now I'm going to iterate over all of the words. We need to multiply them together. So, we're actually going to start with a word probability of one. And then we're going to say, the word probability equals So, what do I need? I need this. word frequencies for this category for this word divided by this. category counts... count All right, let's do it this way. Let word count equals the number of times this word appeared in that category. Let category count equal the total number of words in that category. And the probability is And I'm going to chain them all together, the word count divided by the category count. That is the numerator of Bayes' theorem. Actually, it's not the numerator because I also need to multiply the prior probability. But I'm going to do that at the end. I'm just doing I'm chaining all the word probabilities. I don't like this. Category word count. So, the word count the number of times this word is in the category versus the total number of words in that category. So, this is P of uh A given B essentially. Then I need to multiply it by my prior probability of A, which is the total category counts for that category divided by the total number of documents, which is just this. total documents. Naive Bayes multiplying all the probabilities together. Get all the words, results, look at each category, look at all the words for each category times the prior probability. Why can I not get my brackets right? Oh, there are more brackets down there. That's why. So, now the results for this given category it equal to the word probability. I haven't done the denominator yet. Earlier I talked about how I'm going to do something different with this denominator and you'll see that in a second. So, I'm ignoring the denominator. But I've now calculated the numerator for a given category. Okay, this is done. Let's look at the results. Okay. Let category of categories What's wrong here? this. category Oh, yeah. There are going to be words that don't appear in a category. Well, I'm going to need to apply something
Laplacian (Additive) Smoothing
called Laplacian smoothing. Basically, what I want to do is consider every word appearing at least once. So, I'm saying here word equals word frequencies category word. count. I'm going to say or one. And I got to do a little bit better than this in a bit. All right, I'm going to debug this properly. Don't worry. I'm going to comment all this stuff out for a second. Let's just see, does this work? Yes. Okay, I'm getting these frequencies for every category. Ah, cuz word is undefined. There's no count I see. This is what's undefined. So, I'm just trying to think I mean I obviously can use an if statement there, but I'm trying to avoid it. I'm going to use an if statement. You're watching this video. You were promised a working thing. I'm going to finish it. If this is undefined, I have the work So, the word count is going to be assumed to be one. And if this is defined, you're going to see this I'm going to do some like crazy stuff here, but it's fine. What I'm going to do is say word count plus equal its actual count. I'm sure Again, there is a better way to do it. I think we can safely assume every category has words in it. So, this is a form of Laplacian smoothing, which I will try to define more in a better way in a little bit. This But let's bear with me. What I want to do here is basically assume like every word I've got the words coming in. Some of I can't have zero cuz zero's going to break everything in all my math. So, I'm going to assume give it like the smallest amount of probability if I haven't seen it before in this category. Let's just count it once. And then I'm going to if it actually has been seen before, I'm going to add in the actual count. And the reason why I'm adding in the actual count instead of just overriding it cuz I want to keep that one If I'm adding one to things that were zero, I'm going to add one to everything else, too. So, this should still work. Now, the Laplacian smoothing needs another step to it, but I think this might fix all of the things that broke. Okay, better? Total documents is not defined. No problem. Okay. We've got this. And when we get to the end, return results. We got to return the results. Let's go to sketch. js. Console log those results. What's the other console log that's happening I don't want to see now. Okay, here we go. Ah, look at this. I got scores. I got it. 42% positive, 28% negative, 28% neutral. Wait, those don't add up to 100%. Okay, I haven't done the denominator yet, but I also haven't finished the Laplacian smoothing. So, what I'm doing is I'm saying I have the word happy. And I have the happy count plus one. And the probability happy appears in a given category is divided by the total words in that category. But if I'm assuming every word appeared in that category one more time than it actually did, then the total words is not correct. I have to actually also add the total unique words. Cutting in here to clarify a subtle distinction. I've actually said the correct thing here. I need to add the total unique words. In other words, the total vocabulary size of all the words that appeared across all the categories. However, in a minute when I go to implement this in the code, for whatever reason, I think that I just need to add the unique words for that category, which isn't correct and doesn't make any sense. So, the code that I write will be slightly incorrect and a little more complicated than it needs to be. I'll come back again to show you where the mistake is in the code. And of course, whatever code I release will have it fixed. Basically, I've added one to everything when I'm calculating the probability. So, I just need to increase the total to reflect that. That's this category word count here. needs a needs I need to add the total number of unique words that ever appeared in that category. So, let's see here. When do I know I found a unique word? The moment I add that word to the category, I could say Where did I say I was going to add it? Category counts count unique words. Positive category unique words. So, I could say this. category counts unique words No, category unique words plus plus. However, if it never encountered it before, ah, this is it here. I think the category counts already exist. They all They already exist up here. That's the first time I found a word for that category. I think this is right. No, sorry, it's not correct. I mean, it is correct in the sense that I am counting the number of unique words per category, but that's not what I need to do. do is actually much simpler. I can remove this code from here, and I can just add a property, which is the vocabulary for every word across all categories. That can be an empty object. And then anytime I find a word, I can just insert it into that object. And then later, you're going to see that if I need to find the vocabulary size, I can just look at what is the length of an array of all the keys inside of that vocabulary object. Let's just look and see if this makes sense. Forget about the results for a second. Let's just look at the classifier. Okay, I've got three categories: positive, negative, neutral. For each category, I have the total number of documents and the total number of unique words. So, these category counts and word frequencies do not need to be different objects, I'm realizing, because you can see look, they have the same initial properties and then they have count and all the words. So, this could be consolidated, but I'm going to leave it separate for now. Here, I have the total number of words across the category as well as the total number of times each word appears. Great. So, I have everything I need. The only thing I need to add now is when I do this final calculation, I must have taken out the original word probability uh calculation, which now is equal to the word count divided by the category word count. Now, the category word count needs the smoothing. The word count has the smoothing by adding one to everything. The category word count needs the smoothing by adding the total unique words, which is where that's in category counts, category unique words. And this is the last part of the error regarding Laplacian smoothing, which I should mention is also called additive smoothing. I'm adding the unique words for that category, but I actually have, and let's just make a variable called vocabulary size and it is equal to all of the keys of this. vocabulary. length. And this is what I want to add to the category's word count itself. Again, you won't see this reflected in the rest of the video, but it will be in the corrected code that's linked from the description. Now, we've got it. We've got the numerator, which is uh *=, which is chaining together all of the probabilities of the word appearing in a given category with Laplacian smoothing, assuming if it doesn't exist, let's give it a count of one, divided by the total number of words that ever appear in that category with the smoothing. Then, we multiply that by the prior probability and we have the final results, the final probabilities. Without the denominator, that's the last piece. Let's go back and look and see if the results make sense for what a fantastic and happy day. Okay. Well, these numbers are kind of impossible for us to look at now because they got much smaller probably with the smoothing that I added in. So, it's hard for me to look at these, but don't worry, the denominator is going to fix that. Let's go back to the
Ignoring the enominator (Normalization)
whiteboard for the very last time and talk about the denominator. Going to write this formula one last time. The probability that a given text means it is a given category. This What's the probability this text is positive or neutral or negative? Equals the probability that category contains that text * the prior probability of that category existing at all divided by the probability of that text existing at all. Well, guess what? I want to do this calculation for probability of neutral, probability of positive, probability of negative. For everything that we're calculating, this number and this number is going to be different depending on which category we pick. But, this is just the probability of this text appearing at all across everything. And so, yes, we could calculate that. We could count up all the words in this particular text that we're looking to evaluate and divide it by the total number of words in everything we ever looked at, but this is just a constant. It's a constant for this particular piece of text that we're trying to evaluate. So, you know what? We can ignore it. And instead, we can just normalize the values that we're getting out of the numerator cuz that's the point. We're going to have to do that anyway. We're getting these scores, which we want to normalize to turn into 80%, 15%, 5% that add up to 100%. If this is going to be constant across all of them, we don't need to bother with it. So, that was what I was thinking about all along this whole video is like, I'm going to get to skip that. So, that's the exciting thing. So, we're kind of done. All we need to do now is normalize these scores. So, the very last thing we're going to do is before we return the results, I need to normalize the probabilities. For let category of the categories, I want the probability from that category and I want to get a sum to be zero. I'm going to sum up all those probabilities and then guess what I'm going to do. Just going to divide by that sum. I'm going to do this again. Again, you could use reduce or map or whatever higher order one line arrow function way of doing that. Can't wait to see all of your much better code doing this, but I'm just going to run through this loop again and say results category divide by that sum and now return the results. So, this is me very quickly and easily normalizing those values. We're going to run this one more time. Ah, what did I get wrong? this. categories this. I think that's it. I really should not be using the P5 web editor for this. There we go. 85% positive, 12% negative, 2% neutral. Let's now put something on the actual
Quick User Interface
webpage. So, I'm going to say let uh input or like text input = create input. We'll put this in there. I'm going to do submit button = create button. And let's put this stuff after we train the classifier. Then, I'm going to have submit button. mousePressed and classify text. And we're going to make another function. I'm going to do this in the most verbose, friendly, simple way. This is my event. When I press the submit button, I'm going to get let the text equal the text input value and then the results is the classifier classify that value. Let's just console. log them first. No, let's not console. log them. Let's put them on the screen. And let's make all of this is so silly, but I'm going to make these global variables just to clean this up. Let classifier I mean, again, there's all sorts of other ways to do this. Text input, submit button, and then results paragraph. These are all my global variables. So, this is now a global variable, so I have access to it. The text input, the submit button, and now I'm going to get the value. I'm going to put the results and then I'm going to say for Oh, the results is an object. So, let categories equal results. Uh Now, I'm finally using objects. keys. I refused to use this the whole way through, but objects If I have an object with like three properties in it, positive, negative, neutral, I could get them into an array by doing object. keys results and then I'm going to say the output text The output is an empty string and I'm going to say for let c and I'm just going to say c of categories, I'm going to output + = I need some text here. I need results index uh I need the category. Let's use a string literal. c colon and then results of that category and I'm going to multiply it by 100 and number format it. Look at all this I'm doing now. I'm going to multiply it by 100. Wait, I'm going to number format multiplying it by 100 to two decimal places. Who knows if what I did was right here, but this should be getting my text input, getting the results for my Bayesian classifier, getting the list of categories, iterating them, putting them into What did I call it? It's resultP, result paragraph. HTML, put that output there. Okay. Ready, everybody? Let's run this sketch. Can I How do I make this longer? text area fine create element uh text area. Will this do it? Okay, that worked. Okay, fine. I'm happy now. Are you happy? Submit. Uh what went wrong? Oh, I never made it. Uh result P = createP. Results here. Okay, let's try this again. Look at this interface design, people. I mean, come on. Does it get any better than that? Oh. Okay, this is good. Why did my number format not work? And I forgot about So, we need a line break. That's better, but number format 2, 2. Do I need two things? There we go. Yep. Pretty negative I'd say. Coding challenge
Final thoughts and next steps.
complete. This might be the worst code I've ever written in a lot of ways, but I did explain it and I did get it to work. Let's take a deep breath. I'm not going to be spending any more time in this video than I need to right now, but probably should go to github. com/shiffman positive stories. Base classifier. Let's look at the classifier. Let's take a look at this. Okay. This is my nicer version of this. Let's just have a little walk through of what's in here. Okay, word counts per category. Ah, that makes sense. You have the word and then each category has a count associated with it. That's one way. Category, the document count, how many texts for this category, the total words across all texts for this category. There we go. Total number of documents, vocabulary size. So that's something I kept as a separate property as opposed to like some kind of unique words. I just kept track of the total vocabulary size. That makes more sense. So when we get ah, look at this. There's an add word function. Train so a word probability function and a category probability function. So separating these things out into separate functions makes a lot of sense cuz we can have all this code to just calculate a word for a category. We can have all this code for when a new word is being added to a category. So we can check, does that word exist? If it doesn't, it's a new word, increase the vocabulary size. Initialize the count for all those words. Initialize for the category, then the total word count. Aha, all this stuff. Train creates and this is better creates this sort of empty document count word count. You can see splits it. Oh, I'm validating the words to make sure they're actually like a word as opposed to just some string of numbers I might want to ignore. That's interesting. Word probability, it's got the smoothing. Here's some explanation of it. Look at all this. So I encourage you, if you somehow made it through the way to the end of this video, you've seen the sort of raw struggle of a man trying to make a Bayesian text classification with only a very small breakfast and hours of time. This is a much more thoughtfully written version of it. I will also, it is my intention to turn this into a library that maybe you could use in node. js and I might come back and use that in a future video if there's any interest. Okay, before we go, I got to do one more thing. I got to put something in this particular sketch with the word count in it. Now, is this going to break it? Yes. It broke. I don't know how to fix this. I didn't have the issue. The way to fix it is to use my other code where I didn't use the count properties. I guess what we could do is name all of these this. Let's do this. We're going to do a quick one way to deal with this, not my favorite. Find. count replace with. _count. Won't that work? And then this has to be _count. I guess I have I did it in other places, too. Not a number. Line 17, I missed it. We're back. We did it. Thank you, everybody. See you next time on the coding train.