# Upcoming Plans, ML Sampling with softmax and temperature

## Метаданные

- **Канал:** The Coding Train
- **YouTube:** https://www.youtube.com/watch?v=ZfHVlMqJIxY
- **Просмотры:** 6,995

## Описание

🚂 Website: https://thecodingtrain.com/
👾 Share Your Creation! https://thecodingtrain.com/guides/passenger-showcase-guide
🚩 Suggest Topics: https://github.com/CodingTrain/Suggestion-Box
💡 GitHub: https://github.com/CodingTrain
💬 Discord: https://thecodingtrain.com/discord
💖 Membership: http://youtube.com/thecodingtrain/join
🛒 Store: https://standard.tv/codingtrain
🖋️ Twitter: https://twitter.com/thecodingtrain
📸 Instagram: https://www.instagram.com/the.coding.train/

🎥 https://www.youtube.com/playlist?list=PLRqwX-V7Uu6ZiZxtDDRCi6uhfTH4FilpH
🎥 https://www.youtube.com/playlist?list=PLRqwX-V7Uu6Zy51Q-x9tMWIv9cueOFTFA

🔗 p5.js: https://p5js.org
🔗 p5.js Web Editor: https://editor.p5js.org/
🔗 Processing: https://processing.org

📄 Code of Conduct: https://github.com/CodingTrain/Code-of-Conduct

## Содержание

### [0:00](https://www.youtube.com/watch?v=ZfHVlMqJIxY) Segment 1 (00:00 - 05:00)

Happy birthday. All right. Uh, quick audio check here. Hopefully you can hear me. Okay, I will be popping back in just a couple minutes. Getting set up.

### [5:00](https://www.youtube.com/watch?v=ZfHVlMqJIxY&t=300s) Segment 2 (05:00 - 10:00)

Hi everybody. Happy I don't know to say good morning, good afternoon. It's 12:05 p. m. Eastern time here in the state of New York in these United States. and this is my weekly Monday stream of the coding train. Nice to see all of you. Um, I don't really know entirely what I'm going to do today. I would love to hear from those of you in the chat uh where you're watching from. That's always something that I enjoy. I am my third week in now into the fall where I have been streaming every week. Um the goal of these streams is to cover a variety of different topics and then uh later on in the week and into the next week as I will explain in a second uh various parts of the stream will then get edited and released as uh regular uploads onto the coding train YouTube channel which is the channel that you're watching. So first of all let me introduce myself. My name is Dan Schiffman. I have been uh doing this coding train thing on YouTube for well over 10 years and it's kind of ebbed and flowed and I don't really know where it is right now to be totally frank. Although I'm not frank as I said my name is Dan. Um but I'm trying to figure that out this fall. So the last couple streams, the first stream was an extreme mess. Um and I covered basian text classification. There is an edited version of that will come out sometime I think in the next couple weeks. Again, I mentioned this last week if anybody is um uh I'm sorry I'm trying to keep a an eye on the chat um as well if oh uh hello to many of you. Uh special shout out to oh uh Deepom who has just given me an extraordinary amount of help in the last couple weeks in the discord uh answering questions and reviewing uh various things that I've been covering. Uh what was I saying? So uh the first week basian tech classification. Oh yes if anyone has any particular expertise or watched that live stream and is especially curious uh please pop into the coding train discord if you would like to help give feedback on that edited version fact check it. I see we've got Zenova again in the chat which might lean me towards covering some TransformersJS today. Um, okay. So, basian text classification, that was the stream from two weeks ago. Last week, I did a double stream, which is what I would like to do on Mondays. I couldn't get I don't think that's uh it's already noon and I have a lot going on. I had a um busy family weekend, which caused me to get into the location where I'm doing these streams later and I'm uh more unprepared than usual. Not that I've been very prepared the last couple weeks, but I'm really just walking in here turning the lights on. And so, um, I'm a little bit more out of sort. So, last week, though. Okay. What did I cover last week? Oh, you're not even seeing anything on my screen, which doesn't really matter right now. But let's go. Let's wander over to Ye Old Coding Train YouTube channel, who apparently is live right now. I'm not signed into YouTube here. So this is just whatever. So this is what 84 of you are watching which is kind of a small number but totally fine. That's it's intimate. This is an intimate setting. Just you and me here which is good. Takes the pressure off here. Uh but if I go here to live I'm just curious here. Yes. Uh we had ah last week I covered the uh transformers. js uh as well as p5 2. 0 async and08. Kind of figuring out how to title these. I'm just using like default thumbnails. Could really use some help or ideas about how to best archive them. Although in the end, I think these are really meant to be watched in the moment. And if you're interested in the topic, you're probably better off watching the edited versions that will come out later. So maybe putting any time and energy into archiving these is not entirely necessary. Here is the basian text classification. And I did have some of these like streams over the summer that I do think are these are these two are in particular are worth going back and uh watching if you're interested. Okay, so that's what's been happening. Now, let me make you uh Oh, no, that's not where I'm going here. Let's go over here. Let's make a little list here. So, um edited videos. Hold on. Let's zoom out at this a little bit. It's a little bit too thick. Okay, hold on. Uh, maybe I'm already Okay.

### [10:00](https://www.youtube.com/watch?v=ZfHVlMqJIxY&t=600s) Segment 3 (10:00 - 15:00)

Okay. What's coming out edited video wise? So, today it's already ready. I could even drop the link to the unlisted version of this video in the chat. The P52 async and await video. is coming out. And then um there is also another uh async await video. I still have this problem with this pen. So those are in the pipeline. This one is ready. This one is almost ready. We've got I believe intro to and I'm just going to write TF. js. these days. I think if you hear me or see the label TF. js, I am almost definitely referring to transformers. js. Maybe there's a better abbreviation. This is a very confusing abbreviation because there also is a library called TensorFlowJS which could be abbreviated with TF and also happens to be a machine learning library. So there's that and then there is the uh LLM uh video and then the basian text classification. So I think one thing just to sort of like mention here I'm just kind of being accountable to myself and trying to figure out this system is that I've done two weeks of live streams with all these topics. none of which are the edited videos actually out on the channel yet. So, a lot of that has to do with figuring out the processes of how to deal with the files and get them uploaded and Matia's working on them. So, um, so this is something that Matia and I are collaborating and thinking about, but it is kind of also a reason to just have a little bit of a let me slow down and do a bit of a planning stream, do some Q& A, cover some random stuff today to allow us some time to kind of catch up on these videos. So, stay tuned for that. Uh, okay, coming back over here, checking the chat. Um, I'm looking at these wonderful uh people coming from Indonesia and I'm scrolling back and I see India and Saudi Arabia, the Netherlands. Uh, amazing. So wonderful to be here on the internet with all of you today. I'm trying also to just have a little bit more of a calm energy because I think sometimes I get to these streams and I'm like I'm raring to go and then I like blast through all this energy in the first 10 or 15 minutes and like u these are kind of long so I need to sort of save myself. Um and and this makes me very happy. Um this is kind of what I'm going for. uh this comment from you who says Mondays are now becoming exciting because of your live stream. What here's a here's also just a small uh little note. I have I really should just schedule these, put them on YouTube and have them scheduled, but what happens to me is I'm like, I'm not sure what time is going to work right for me because sometimes I have like a quick Zoom meeting I have to do or so. So, I end up not scheduling these until like Monday morning and I just like throw it up on YouTube. I don't know. Uh Kev says, "Hey, I got an idea. How about coding today? " All right. So, this is a very important comment and I would like to do some coding today. So, let's go back over to the whiteboard and I would like to kind of tell you a bit about my thinking about the topics I'm hoping to do. I was going to say in the next couple weeks, but really I'm kind of talking about this semester. I think I'm looking the coding train. Uh my current kind of plan for the coding train is to treat it as something very seasonal. So we're going to do a fall semester, spring semester, you there's probably going to be like a winter break. Uh so I think one thing that I discovered is it I think in doing this is if I commit it's much better for me to commit to some time and kind of have that planned. The lawnmowers have begun. I I'm going to talk about the lawn mower in a second. Oh boy. So funny. So funny to me. Okay. But it's what I'm I It's much better for me to commit um and then and then have planned times where I'm not streaming or recording than just kind of like always feel like I should be doing it and trying to fit it in and squeeze it in. So right now every Monday I will say that most likely two So next Monday I'll

### [15:00](https://www.youtube.com/watch?v=ZfHVlMqJIxY&t=900s) Segment 4 (15:00 - 20:00)

be here. I think it's uh whatever is um the holiday here in the US, Indigenous People's Day, a Monday. I think it's the 13th, the 12th, 11th, I can't remember. I think I probably won't be here that day. Uh it's a holiday. My kids don't have school and I don't know what the family plan will be. So, uh there will be some Mondays that I miss, maybe Thanksgiving week. I don't know. I'm sure I'll I was sick actually the week that I did the basian text classification, which maybe is my excuse, one of my excuses for why that didn't go very well. Okay. U. All right. So, but let's talk this through. So, um there's a couple different there's a couple categories here. Here are my categories. P52, P5 beginner. This is actually and then what I would say is ML. And again, I'm allergic to the letters AI, even though that's kind of what I mean, but I want to really focus on it uh being machine learning. So, these are kind of what I'm thinking right now. P52, we've covered async and await. That's done. Uh, I want to cover uh probably should cover variable fonts and uh font uh out. I don't know what to call this. The outline of a font, the sort of like vertices of the letter form, font geometry maybe I would say. Uh, I want to cover custom shapes changes in particular splines. So, that's my list of P52. Um, okay. So, there's a loud lawn mower mowing around here. I am uh at a residence, which happens to be my residence, but things are kind of complicated because I'm in the process of moving. Um, and uh, I don't mow my own lawn as you as might shock you here. And um, the mowers that I have hired, they come on Mondays. And I tried to reschedule it, but there was no other day. But then I sort of thought the plan was to have them come like really early in the morning. So that's why also I delayed a little bit today, but then they weren't here. I was like, "Oh, maybe I'm confused. " Anyway, of course, Murphy's law. Whenever I start streaming, the lawn mower starts. There's also I don't know if there's like a hive somewhere hiding. I'm in a garage in this little attic crawl space above, but there's like three or four bees. No, one, two, three, four, five. And they're all by the sort of like window over there trying to get out. So, I gota got to figure that out. I'm a little This space uh is uh was not meant to be my recording space this fall, but life gets in the way and I pressed the wrong button again. Okay, that's enough of this these personal tidbits. Okay, but this is good. This planning stuff is fine while the lawn mower finishes. Okay, P5 beginner. This is really um I'm just going to make the list basics variation uh conditionals uh loops arrays functions. This is my new kind of order objects. So, um, I want to I don't because there's a new P52, I want to redo all of my beginner tutorials. However, if I were to mark some of them, maybe with a highlighter. Let's try this highlighter. I haven't tried this before. Uh, and actually, let's go over to the computer for a second because this will help us. Um, it the thing about the lawn mower is it causes me to speak much more loudly because I think I got to shout over it. I don't think you all hear it. I think I might have noise reduction. Uh I know the noise reduction is getting rid of it. It's more just an issue for me. Um okay. So if I go to the coding train website and I go to uh tracks and go to here, this is what I'm talking about, which is redoing this series. But if you look at it, um, you know, this needs a refresh because the web editor's changed a bit, but it's really fine. This kind of is mostly fine. It's using the web editor. Fine. All this stuff still applies once I get to here. These I kind I redid. I don't know about these thumbnails. Um, although I kind of actually really like them. Random map. Oh, ancient video. See, here we go. The map video. So to be honest, like what's the point of redoing that? The concept and of map and everything about

### [20:00](https://www.youtube.com/watch?v=ZfHVlMqJIxY&t=1200s) Segment 5 (20:00 - 25:00)

it has not changed. So that doesn't that's not too high. I don't know why I create graphics in here, but I always feel like it comes up. So I made this extra video and put it here. But let's look here. Aha, here we go. I really feel like these should be redone. They're the oldest. Uh so redoing those might make the most sense. Um, let's go back over here and see if my highlighter works. So, under uh variation, conditionals. Oh, no. Variation we decided was not as big of an issue. So, it's conditionals, I know loops, and I know arrays. And then, um, and actually, and functions. So these are the ones that are the most old because I redid objects whenever I decided, oh, ES6 classes is a thing. And then I redid the basics and variation a number of years ago. But I do think there's kind of like I don't know what pen to use here. I'm a little I'm kind of overdoing this. Uh I do like I teach arrays and the uh and functions quite differently now. And I think the video on functions uses like an object literal and it's kind of confusing. It doesn't follow this trajectory really well. So these would be like sort of like priorities. So, I kind of have this question in my mind of maybe I should just take whenever if I could somehow have just like a few days free, I could just like run through all of it just in like two days streaming and then cut them up into little or I could just take the approach of like slowly over time out of order, let me redo a bunch of these topics. Um, and certainly if there I'm not looking at the chat right now, but if there's any teachers or people who use my videos as part of their curriculum watching in the moment or later, please reach out to me. I would love to if you know a certain way of doing this would be most helpful for you. But I think uh what is kind of what I also have a little bit of a mental block on is I would like to um let me just get a uh the black again here. I would like to do a new intro which covers history and then kind of talks about like why maybe you still want to learn with AI. So, I really want to make that video. I actually wrote a script for it just as an exercise. I mean, it's a total mess. I couldn't just do it right now. Um, maybe it's worth talking through some of the ideas there and like in a live stream and kind of but that's kind of that's my plan. Uh so the uh okay so these are things now uh under the MLM ML topic there are some things mostly you know I want to continue with TFJS meaning transformersJS um I would love to cover uh depth estimation image segmentation just making a list of things on my mind uh I want to look at a uh how you might do speech to text and text to speech. I can't decide if those should be like separate tutorials, one longer like coding challenge video where we just make a hey conversational chatbot. But let me just write these out. Text to speech, speech to text. Um I actually hold on. Let's save myself some room. So those are things I want to cover. Um actually I also this is a little bit separate and this is what I was thinking of doing today. I wanted to cover um soft max sampling and temperature like this is my idea for doing kind of a separate tutorial. So this is a little confusing here but my idea for this is uh one I have these marov chain examples that I've been using for years and I'm also way wandering into as you saw earlier looking at uh LLMs. Oh and I have a question for Zenova if you're actually still watching or if you're like if you just heard your name and popped back in. Um uh but I you know I thought that like a separate video that just looks at I have an array of uh things and each thing has a score. So h what is the math for softmax that allows me and you might not know at all what I'm talking about the point of this I would explain this but softmax is a mathematical function that gets used with neural networks in the last layer

### [25:00](https://www.youtube.com/watch?v=ZfHVlMqJIxY&t=1500s) Segment 6 (25:00 - 30:00)

before the output is picked um for even for like image classification you're just turning all these like positive and negative arbitrary score values into probabilities that all add up to 100%. So, Softmax does that. I've already actually made videos about how if you have a list of things each with a probability, how you pick one randomly based on the probabilities. But what I haven't looked at is how you can then add this temperature variable to um to either stretch the probabilities so that the more likely ones are even more likely or like flatten them, which could make your system a bit more chaotic. And that's the idea of temperature. Uh wait, turning the temperature up. It's actually turning the temperature up flattens the probabilities. It's a little counterintuitive because you're making it hot, but the reason why it's hot is because you're picking crazy stuff that's like normally very unlikely. So, that's actually what I was thinking of covering today. Uh, and I think this is mostly I mean I could keep going here because I want to do like object detection. Object detection is one. Uh, image captioning. There's just so many things. Zenova could give us a list of all the most exciting things to uh try the latest and greatest stuff. Oh my goodness, 100% this semester. I totally forgot this actually is like uh embeddings and this is something I could actually do and like similarity sentence similarity all sorts of um kind of stuff there. So I think this is my map of what I'm looking to do this fall. Incredibly organized here. Let me come back to the chat and get some feedback. One nice thing I set up a new scene here so I can actually bring this is my map and I'm just standing in front of it. So, let me leave this here and see if I could get some um uh feedback or thoughts. Uh Zenova says, "Oh, yeah. Oh, look at this. So, I made this new scene and so once I put the comment thing on, it's just like this giant comment here. Let me fix that. Uh, while I'm here, I the other thing is like I really want to improve my setup, but okay. Uh, I don't know which topic you meant was uh useful. Um, and I'm gonna just ah, okay, hold on. I'm I like that Zenova is giving me positive feedback here on this. And now I like this. I like this comment as well. Let's bring this up. Combining it with a beautiful visualization that has temperature parameter slider would make it really interesting for students. Yeah. Well, I I'll show you the example that I have. Um, so first of all, I don't think you're very likely to get a beautiful visualization out of me. You might get a visualization. What I do like about this is that it could um um it could um I'm reading the comments at the same time that I'm thinking, which does never works very well for me. And I already forgot what I was saying. Uh oh. Oh. Uh, one of the things that I do here on the coding train is when I make a video and it ends up on my website, then people can submit their own versions of it. So, the perfect thing is for me to explain it, show a sort of barebones explanation and demonstration, and then people can make their own versions of it that are have more interactivity, more visualization, that kind of thing. Um, okay. Um uh so many interesting uh um thoughts here. All right, so let's do a little bit of a let's see if I can get a poll going. So I you know maybe that's what I should do. I should I got maybe I should work on this idea of softmax sampling and temperature. And in fact we could then tie it back to I got it could the LLM example. So let me bring and this is where I had a question for Xenova. Let me bring that question. up. Steve Mold has a video with snake with water in P5. Tell me more about this but later in the Discord maybe because I don't want to get too distracted now, but I if Steve Mold is using P5, I'm like a huge Steve Mold fan. I got to check that out. Okay. Um, oh, the other thing is I could continue I forgot I could continue the lesson from last week um on um oh uh the LLM video that I made that just used small LMOL I don't

### [30:00](https://www.youtube.com/watch?v=ZfHVlMqJIxY&t=1800s) Segment 7 (30:00 - 35:00)

know how to say it. Um I could do another video where I actually explicitly show how you add streaming and a conversation history. So that might be worth actually doing just as like a part three, two, whatever, however I'm counting that stuff. Um, so maybe I should focus on that because then I could come back and explain softmax sampling and temperature, which leads me to my question. So that maybe I'll just continue with that thread this week and not put any pressure on myself to go I um I'm still trying I have working examples for text to speech and speech to text, but I'm still trying to figure out if there's a way that I can simplify them and use p5 sound. um better. So maybe and I could work through that during the stream, but um hold on. But I'm trying to come back to here. Ah, yeah. So, let's go. Let me get the code from last week, which um is maybe not. Is it this one? Is this what I wrote during the stream? Yeah, this is what I wrote. Who are you? And then we wait for it to load. It's ready. Okay, great. Okay. What I wanted to what here is here's figure out. So where um let me move this over a little bit actually. And this is plenty of room to come more this way for me not covering the code. Great. Um and I could actually even move a little bit this way. Um okay. What I wanted to figure out Let me just I'm going to leave this. Let me duplicate this. Is where do I put the Is this actually the code? This is the code that I wrote last week. It must be right. Um I think I put temperature. Oh, first of all, why did the system prompt? I can't remember what I The other problem with me doing this right now is I don't have the edited ver. It would be really helpful to do any continuation once I have the edited version which usually helps me realize the things that I missed and then I can bring those up at the beginning of the next video. But anyway, let's go with this thought. Um ah uh K Vikman uh has a great comment here. maybe make mention the connection to the Boltzman distribution which is why temperature shows up. I would love to mention that but that would be something I would need to sort of research a little bit more to I don't know if you want to say a little bit more about that in the chat or point me to a link that might be a good reference. Uh that would be great but it's a great uh suggestion. Okay, I'm just trying to figure out where I put the temperature. So I think it's here and let me Well, making it bigger is not helping. Also, when I load this model, boy, does my browser This is like the basically the most expensive newest MacBook you could get. And I still often I feel like it's actually often when I'm like plugged into HDMI out and recording my screen that everything seems to die. But, um, okay, I am using the bigger model, 1. 7 billion parameters. Okay. Um, I don't know why all the I have a cough. No issue at all before the stream. Hold on a second. Let me mute. I'm not muted. Okay. Um, I'm trying to figure out where to put the temperature. So, nope, that's not it. Or I clicked too soon. Okay. What if I were to make the temperature two? See, I think this is exactly the same reply. Like, first of all, does though these are my questions for Xenova. Oh, and the chat, I'm sorry, it's not I'm not scrolling with it. Um, do sample true. Oh, wait, wait, wait, wait. Sorry, I need to do sample true where? Okay. Ah, do sample. Ah

### [35:00](https://www.youtube.com/watch?v=ZfHVlMqJIxY&t=2100s) Segment 8 (35:00 - 40:00)

so this I need to do. There we go. Okay. So, max new tokens do sample temperature. So, without do sample, what's it doing? I mean, it has to sample, but can I like set a random seed? That's what I'm trying to figure out. to reproduce the exact same result. Ah, the soft I'm just going to uh to follow up on this. The soft max gives the probability of finding a physical system at a given energy level. Got it. Um, so it's the same mathematical function as the Boltz Bolt I just trying to get the spelling here. Boltzman distribution. Let me see if I can find a nice reference a specific state. Got it. So, um, yeah. Is softmax mentioned here? Yes, I see. Is related to the Boltzman distribution? I don't know it might be out of scope for what I'm doing but it is a very nice uh nice uh pointed point there. Okay, but I'm going remove the comment. So by default it's greedy. So no multinnomial sampling takes place. Lay person language please. I mean I think I understand what that means. Uh in principle temperature has no upper limit. That's correct. And that's why I want to show the math for it. I think this is what we're going to do today. You know, I think it's okay for me to just slowly get to things like this comes up in my class. I have no video that I can point to for people to um Okay. So, but now what I want to see is if I change the temperature to two, which is like should be very high. I can't tell. I can't fact check this. Okay. So, let's try something. Uh, you are a frog who only ever says ribbit. Why does it will it not open? Do I have this open in another window? No, these are other things. Is it like open in some other window on some other computer? Let me just copy paste here. No, this is even a copy of it. Doesn't make any sense. Me do this. Okay, now I can save it. Let me try running this. There we go. Okay, Ribbit. I am a young frog. Ribbit. Okay, so the system prompt is working. I wanted to double check that. Um, and then, um, what I wanted to sort of see is if, you know, I'm going to lower the temperature. And I should obviously make this like a slider that I can control while I'm running this. So, there's no history here. So, we're seeing with 0. 1 temperature, this is what I get. And with a temperature of two, I'm trying to make extreme differences. I mean, I don't know. Can we or can we get a sense? sense of, you know, demonstrating this effectively? Um, I wonder if I give it like how many are like kind of more of a task. Can we have it be wrong more often once we change the temperature? That's what I kind of want to demonstrate. Um, just do this. I mean, I could actually just take out the system prompt right now. like just give it no system prompt. Oh, whoops. I forgot. Uh, okay. So, a temperature of I guess just like 0. 5. Let's do it. Okay. Ah. Okay. If you set the temperature to 0. 00001, let's try that. 00001. Okay. And let's let me put this just in here just as like a Okay. So, let me run this.

### [40:00](https://www.youtube.com/watch?v=ZfHVlMqJIxY&t=2400s) Segment 9 (40:00 - 45:00)

What's going to be a good demonstration? The word strawberry contains no letters that are not there in the word letter. So, it doesn't exactly count as making a fruit name out of letters. Wow. Is that Let's run this again. And then what? But this is So this is not what I would expect with such a low temperature. This is maybe what I would expect with a high temperature, wouldn't you think? I mean, I don't I could gloss over this in terms of the I I'm just trying to work this out in terms of making going through the math and actually explaining it. I mean, the these definitely feel like different responses. even lower. What if I made it zero? You can't set the temperature to zero, can you? Like it really should be the same. Like with the lowest possible temperature, like the temperature basically being zero, it should be I should get the same result every time essentially. But I'm not seeing this paper like I'm skeptical that this I've actually put the variable in correctly. Like I cannot detect any difference between you know okay I mean this is a good point. I wouldn't call it an awful LLM model. I think this small is a terrific model because it runs right there in the browser and you can use it for all sorts of creative things. Uh I but you can so maybe this isn't the best um yeah right Zenova I agree. Um maybe this isn't the best demonstration. And it also might be cl could it be clamped behind the scenes like the um whatever I don't know what layer of the library or the model uh could be actually not taking our raw value but like constraining it to within a range. Let's actually okay let's do sample false. Let's make sure this is right. Okay. So, this is good to see. This is a good thing for me to demonstrate actually sampling versus not sampling and then changing the temperature which um you know I also could like you're saying I could use a different uh model. Uh I do want to cover um using Olama as well. I could add this to my list. Um I come back over here. Uh I also want to look at if I put it down here. Olama. Okay. I think the mowing is done. So um I think I can move towards um making a um Yes. Uh, right. I don't know if it this could be the wrong spot for the temperature parameter. I just sort of assume it's there. And we've got the creator of transformer. js in the chat. So, I assume if I have it wrong, he would say something. Okay. So, let's make this the topic for today. And um you know worst case scenario I think I'm I'm struggling with like worst case scenario this particular topic that I want to cover doesn't necessarily have to be um edited down. It could just be a thing that I'm covering during the live stream, which I think is also like a path for the coding train at some point, which could be to more just focus on covering topics during live streams. But the thing is like they're very I don't know. Not that's not what I'm trying today. All right. But um and if there's some time left over, maybe just to dip my toe in, maybe I would try redoing one of these uh or doing another look at like another P52 feature. Okay. So, I'm got to get a few things set up.

### [45:00](https://www.youtube.com/watch?v=ZfHVlMqJIxY&t=2700s) Segment 10 (45:00 - 50:00)

The other thing is I don't know where these videos, you know, should live. I guess maybe it's time for a new track. This could use some reorganization. A lot of this stuff here. Um it's such a great system um to have for the channel, but I feel like I've kind of it's ballooned a bit and I haven't like really been keeping it maintained. So u I could really use help. That's the other thing I could do today is I could show people how the website works and actually make a track for the P52 videos because I would love help from the community in maintaining the website. The website is a totally open- source system that you can very easily through just like markdown and JSON files uh contribute to and help maintain and I would love more help from the community in doing that. Okay. Um is Gemma available with um transformers. js? I don't remember. It is in the right spot. Probably something Let's let me just put this on the screen here. Probably something funky happening with the sampling at such low temperature, but for the most part, it should operize operate correctly with temperature 0. 1. Yeah, let's just um let's just It's the lighting, by the way. Uh Shumi is making a comment. Uh who's this guy here? No, it's the lighting in here really shows my gray hair. I just couldn't see it with the lighting. No, this is literally like 10 years ago. This is what happens to a person. Uh although I'm in much better shape than I was probably 10 years ago because I my exercise regiment is very strong these days. Uh but I boy do I feel old sometimes. Maybe I should retire from YouTubing. What do you think? Everybody was that was a thing for a while and I was just instead I just like burnt out, petered out. Now I'm sort of back but not really back. I don't know. Okay. Uh let's try to let's accomplish something. It's already almost been an hour. Uh and I think much past two o'clock. Um, not going to get very far. Okay. All right. Let me keep some things open. So, this is something I need to have open because I'm going to refer to it. Uh, do not need this open. What I do need is uh I have this Marov chain video and I don't know which one I want to refer to, but let's look at this. Let me just see what's going on here. Okay, hold on. Let me do some stuff here first. Oh boy. Let's update to P52. Why not? We don't need DOM anymore. Uh let's go here. Okay. And then how did I do this? I think I did this. How do you make something everything? No, not star. There we go. But that was crazy. Uh, turn the auto refresh on. Okay, great. Okay. Uh, oh no. And this is using an order of two. I think this actually is not a good example for me to use. Wait, what's it loading? Oh, this is I didn't realize how long that was. Where did this come from? What did I don't remember what text I used for this video. Let's go to the other one. Uh, name generator. Uh, right. Oh, this was me trying to rename the channel. Okay, this should work. Um, let me I can't update this to P52 because that will cause loading problems. Oh, but you know what? Let me duplicate it. Part two. You You'll see why in a second. I'm going to do this. Um, order three. Uh await. No async.

### [50:00](https://www.youtube.com/watch?v=ZfHVlMqJIxY&t=3000s) Segment 11 (50:00 - 55:00)

Okay. Make sure this still works. No. Oh. Oh, wait. There we go. Oh, now I got sorry to P52. There we go. Uh, let's change this to So, there we go. Great. Okay. Oh, I forgot I have this comment on the screen still. Um okay. I think we are now good. However, now let me go to uh okay so I have this example is what I'm essentially going to build. Um, but I need to I also need this video. — Uh, okay. Is this the one that I do? Oh, yeah. Okay. So actually then I also need sorry I'm trying to figure out where I have this. Okay in uh here weighted selection. There we go. Okay. So here's my weighted selection function. I don't think I'm going to explain this again because I have it in nature of code and I also have this video where I cover that algorithm. So okay references are this I'm going to this is what I'm going to be programming. So, this is just kind of like let me stick that over here. Well, I'll show you what this is. This is like a whole bunch of um a whole uh an object with key value pairs. So, fruits with a score and then I'm picking them, sampling them here according to their score. And then I can adjust the temperature. This is essentially what I want to uh build uh and then apply it to look at how I can apply that to Markov chains and in order to apply it to the Markov chain which is this code. Okay, hold on. I need this code. This code markup chain. Okay, so this I don't need. This is maybe a reference, but um this I'm not going to look at today. B this I don't need. Okay. So, the main thing though that I need to do here is All right. So, one thing I need to do here, which I'll change. Yeah, just look at this for a second. If we look at and the Markoff chain, it's showing me every threeletter sequence and let's look at like this one for example and everything that comes out after it. So what I need to do though to be able to do weighted sampling with like what I do in the current Markoff chain is I just pick randomly one of them and it's more likely if it's in there a bunch of times. But what I actually need to do is count how many times each appears because those are the scores so that I can apply soft max and then uh temperature. Okay, let's see if anybody else has said anything. So I think I'm ready to go now. I have all the pieces of what I need. I don't understand where this video is going to live on the coding train website or in what playlist, but I guess it's just a standalone video now. And um let me go over to the whiteboard. I want to absolutely keep this list which is this is why am I not signed in? Okay, that's weird. Sorry everybody. For some reason I'm not signed in. And in order to sign in, I probably need to do stuff that I don't want you to see. So I'm just going to um uh you can just enjoy you can enjoy Oh, you can enjoy watching this sketch run for a minute while I sign in here to the vibe board.

### [55:00](https://www.youtube.com/watch?v=ZfHVlMqJIxY&t=3300s) Segment 12 (55:00 - 60:00)

Wait. Okay. Sign in. Okay. Oh, quick sign in with my Hold on. Apparently I can QR code sign in. So, let me try that. Um, I am signing in. Just so you know, I haven't completely left you all. Oh my goodness. No, I don't want to install an app. Oh jeez, this is so weird. I don't I think my board got the board got reset because I don't know if people remember last week. Um sign in. Uh okay, this is working. I think last week I had an issue where the board was going to sleep. Your sign request is not permitted. This is so weird. What is going on? Okay, let me sign in using my I mean, I should just deal with this later. I can figure out how to save stuff later, but let me just give this one more shot. Yep. Okay, this is working. Uh, but you're just going to have to wait a minute here. Terrible. Terrible live streaming. Put Okay. Oh my god. Help. DTS. Okay. All right. Hold on. This is going to just take a minute here. All right, let's look at this. Almost there everybody. Okay. Just need a two factor. Okay, I believe I am now logged in to this board. But where's my thing that I did? Okay. Well, I'm logged back into the board. Uh, but I seem I see that filled up. Um, but uh okay. I don't know why I lost my canvas from before. Uh, so I guess I can log back out and find it. It was such a nice like list of all the things. Of course, it's captured forever in the recording of this uh video, but okay. Um, all right. So, I'm now going to attempt to cover this topic and I have to think about how like, oh, I did all this time deciding what I'm going to do and now I have to think about how I'm going to do it. And I also have to remember that the point of what I'm doing here is not to make some sort of perfect edited like scripted, succinct, concise, fully thought out and perfectly explained uh video about everything there ever you ever needed to know about Softmax and sampling. But to just cover it here with all of you in this moment in this live stream together and assume that a slightly edited shorter version of that is a useful thing and if it isn't that's also fine. Always point people to the

### [1:00:00](https://www.youtube.com/watch?v=ZfHVlMqJIxY&t=3600s) Segment 13 (60:00 - 65:00)

time code about now in the video. I can also for Matsu I can press this add marker button which um will just save a little text file with time codes from the stream of every time I do that. Okay. I kind of want to make a little outline for myself though of what I'm going to talk about. So, first I'm going to remind people about the Marov chain and it's uh it's technique for sampling. Then I'm going to remind people about the nature of code genetic algorithm and the weighted sampling there. Then I'm going to connect this to modern LLMs or and also like other ML like image classification so that we can then learn about softmax and then uh temperature and we will finish with some kind of example that does something like this basically. Okay, how's everybody feel about that as my plan? And then we'll see where we are. Hopefully this won't be more than I mean in theory the idea is for this to be like a 15 10 to 15 minute like video at the most. I mean, this could be like a five minute video just to explain it, but that's just not what I do, but I'm thinking this might take me an hour to sort of like muddle my way through. Okay, I'm just trying to decide if I want to start at the whiteboard or start here and what to show first exactly. All right. Oh, this also is an excellent post from Oh, it's Wow, it's from 2020. March 1, 2020. What an auspicious date. Uh but this also greedy, right? Highest probability. Oh, it's greedy without the do sample. Ah, I see. I see. Okay. Wow. I should really read this post before I make this video. Just looking through it to see if there's any other good references here. sampling. Yes, probabilistic s do sample true deactivate top k sampling. Oh yeah, I should cover also what top k and top p are so uh top k. Oh, look. We'll just get an AI overview that'll uh restricts the token selection to the K most probable next tokens. So, without do sample, it's the equivalent of top K is one, right? You're just picking the top K. Uh, okay. Where are we going to get a nice top P here? Explanation. Top P is this is like you could change top P instead of temperatures from what I understand. diversity of generated text by setting a cumulative probability threshold. Oh, so it's like top K but the threshold right low top P and high temperature create a mix of common and slightly less common words balancing predictor I don't really understand that okay uh top one consider all tokens top P zero the single ah greedy decoding so a top k a top p of zero is greedy and a top k of one is also greedy and as you do so one is only so top p and top k are

### [1:05:00](https://www.youtube.com/watch?v=ZfHVlMqJIxY&t=3900s) Segment 14 (65:00 - 70:00)

the same kind of idea it's just top k is like actually counting how many options will we pick and top p is what's the probability threshold for which we will consider it an option um And it's kind of like inverted like zero is only like the top the single most probable token. Oh, you're taking you it's not actually a threshold. You're taking 90%. Um Oh, thank you. Don't worry. You you've uh you um I have I'm going to have that list. No problem. And I'm sure if I sign back out of the account I signed into, it would be there and I could get it again. So, I'll figure that out. Okay. It is generally recommended to adjust either top P or temperature, but not both. Right. Uh, that makes sense. Okay. Decrease top P for well. Yeah, interesting. Okay. Um, okay. Possible next tokens involves sorting these tokens by probability and descending order and summing their probabilities until a cumulative total is reached. ah accum now then considers this the nucleus or subset got it still don't understand exactly um still understand exactly what it means by so you sum up all the probabilities and if those probabilities um so why would zero give you only the most so and one. Oh, I see. Once you start with the top one, you add it to the sum. Once it's above the threshold, you stop. So, you would only get to one if you add them all together. Got it. That's actually how my the like the way the sampling algorithm works. Okay. I think I understand these things. I kind of hate I sort I hate that this uh reference for the video has this Oh, I know how I can do this though. If I want this in the background, um, I can just have it like paused. Okay, that's fine. So, I can reference the video without having that like crazy looking thumbnail. Okay. Oh, and actually, this is kind of the one that I'm using the example from. Okay. I made some notes about soft max. I wrote out all these notes in a notebook and I didn't bring that notebook with me here, did I? No. I'm tempted to give you to take a five minute break and go back that get that notebook where I have all the I wrote out all the formulas last week. I was teaching this. The lawn mower is still going. Let's take a five minute break everybody. I know this is really annoying for those of you watching but I'm going to run and grab that notebook um so that I have those notes with the different formulas in them um if that's okay with you all. So I'm going to put this Oh, sorry. I was muted. I How long was I muted? Um, sorry. I'm going to take a five minute break. Music's running. I'll be right back. I'm going to grab that notebook. Okay. Heat.

### [1:10:00](https://www.youtube.com/watch?v=ZfHVlMqJIxY&t=4200s) Segment 15 (70:00 - 75:00)

Okay, I am back and I brought my other notebooks, but I don't know what page or where. So, hold on. Give me a second. I think it's in these. I wouldn't be in this one. I don't think I was using this notebook last week. Okay, there's also this one. This is so confusing. Too many notebooks, right? I think this one I determined was no good. It should be in here somewhere. Okay. Ah, yes. Here we go. Yeah. Okay. Actually, I barely put anything in this notebook, but I went all the way back to uh I guess you can't The lighting in here is to get this. Okay. But it's fine. It's a nice reference for me to have uh there. Okay. Look at me procra. I'm just like procrastinating, procrastinating because But let's do this. This is going to be great. Okay, I'm just going to get these two pieces of paper. Okay, I'm going to end up looking up the Wikipedia page for these formulas anyway. All right. So, um, the other thing I just want to mention to everyone because it's a little bit confusing if you haven't watched one of my live streams before is I'm about to go into I'm recording a video mode. So, I'm going to like pretend as if like nothing we talked about for the last hour and 15 minutes happened. Um, so, and Chris Flannry who says, "I just got here. " Well, and I think your timing is perfect. Really, he might be interested in all of the chitchat and other stuff that I talked about for the last hour, but I'm going to get into today's coding topic and start to work on it now. All right. I lost, by the way, I took that five minute break and I lost like 30 viewers, which is fine. Let's gain them back. Hey everybody, tell everybody around the

### [1:15:00](https://www.youtube.com/watch?v=ZfHVlMqJIxY&t=4500s) Segment 16 (75:00 - 80:00)

world that Dan Shiffman live on YouTube is going to sort of kind of mostly explain what Softmax is. I think the entire world is going to want to tune in for this. The numbers are going to go through the roof. Okay. Um, all right. I think I might actually, well, I'm going to start here and then I might move very quickly over to the whiteboard. Um, because I have an idea of what I Let me just I know I'm like I'm like refusing to start, but just give me a second here. Uh open. Oops. How do I get to all the ones I had last time? No. Uh how do I get to what's this do? Uh no. Oh, I figured out some weird stuff that I can actually open up like P5 sketches here and like annotate them. Um, but how do I get to all of my other Yeah, here we go. I was just sort of like um trying to see if any of the things like this is what I was talking about last time. Um. Oh, no. It's like this looks similar, but that was okay. So, all right. Okay. All right. I just got to do it. Will I bring back coding challenges? Yes. Maybe that's just what I should I don't know. Yes, I would like to. Okay. Hi everyone. In today's video, I am going to be covering a topic. See, this is my problem. I refuse to like just allow myself to just do this. What I really should do is just record like an intro later after, but it's okay. I'm adding a marker for Matia this. Oh, and I'm even in the wrong I'm in the wrong view. Okay, the lawn mower. Let me just see where it is. It's okay. All right, here we go. 120. We're gonna we're going to talk about this for the next 40 minutes. My viewers came back I know the wrong camera. It's okay, everyone. I'm going to figure out the wrong camera in like 10 seconds uh in your time. I just I cannot believe the comedy of errors that is the lawnmowing and my video recording which happened very infrequently like one day a week and only within a like two hour period on that day. And yet somehow no matter what I do, those two things happen at the same time. Okay, but we're going to be fine. Hi everyone. Today I am covering a topic that connects to a bunch of different themes and things that I've been covering on this channel for a number of years. Um, and It got See, you can't hear it. It just like it got really loud there for a second. Okay. Luke says he can't hear the mower. See? And then I say like um and I pause and I get worried and No, no. All right, everybody. I I don't like the origin of this, but it is very appropriate right now. We're doing it live, people. We're doing it live. I scared myself with how aggressive that got. You know what I need to do? This is what we need to do everybody. This is going to make it okay. It's time for to read some random numbers. I don't last time I did this. I don't know when this was. Do we have some like relaxing music? That will work with this.

### [1:20:00](https://www.youtube.com/watch?v=ZfHVlMqJIxY&t=4800s) Segment 17 (80:00 - 85:00)

Oh, that's a nice little sound effect. 33 07. Oh my god, it's so loud. We're just gonna have to wait till this guy finishes. 65578. Let's see how many viewers I lose when I read just live streaming reading random numbers. I think it's better if the book is much more vis visible. 428672048 6244 57592 26354 38251 43065 61222 4841 6341736 0635 7352 87868 95894. Some people do deep breathing. Other people read from a random number book. Like you know, you're all addicted to Tik Tok. I can only assume you should get yourself one of these books and just read it quietly to yourself. I like how the viewer number I'm like watching the viewer number as I'm doing this and it's going up. 96734 20733 4957 544 the 409488 67496 77673 Okay, I hear the douls sounds of the mowing. off in the distance. Um, I have been Corey says, "Looking for the audiobook. " I have been threatening to make an audio book for this. It would be very, very long. We could also automate it. That would be a good 11. Actually, that's a really good 11 Labs project uh with my voice clone um that I could do. Oh, that also I don't Yeah. Okay. All right. Um, I think I can get going now. Hi everyone. Uh, today I'm looking at something called softmax, temperature, weighted sampling, top P, top K. All of these terms that have to do with well in particular, they're probably most relevant these days or where you hear them the most is how does a large language model pick the next token pick whatever it's going to say? How does that process work? It's some type of there's some kind of like statistical model of language with probabilities train as I tshel says I know I'm just like I don't know what I'm doing here. Wait, let's press the marker. I didn't start. If I press the marker then Matio won't know that I did this a few times as I'll tell him just go to the last marker and I meant to be in this view uh because this is what is best for recording. All right, this is confusing because I don't understand like the context of this video of what I'm talking about. But let's here's here's my commitment. I could always go back and record a hello, this is what this video is afterwards. So, let me just get started. Hi everyone. Uh, today I'm covering a topic that interconnects to a whole bunch of things I've been doing on my channel. But the reason why, and I'm looking for my marker here, the reason why I want to cover it in this video is because I'm starting to work with language models. I recently made a video that shows you how to load a language model called SML LM. This model is compatible with JavaScript and runs locally right in the browser.

### [1:25:00](https://www.youtube.com/watch?v=ZfHVlMqJIxY&t=5100s) Segment 18 (85:00 - 90:00)

And we can do things like prompt the model. Let's say I were to prompt the model with just one word. This is the input to the model. Now, what you might expect to come afterwards maybe is the word hello or Dan or how maybe it's going to say hello, how are you? I don't know. It's going to complete your initial sequence of text with other text based on some set of probabilities. That's not what I wanted to say. I'm just doing a b bunch of different things here to get myself into this. It's fine. And I pressed the stupid like button again, which creates new pages. There we go. Okay. So, usually you're So, usually if you're working with a language model, you're probably prompting it with a sentence, a paragraph, a lot of stuff. But just to simplify things, imagine if you're asking it to fill in a blank. Hello. Blank. What might come out? Well, maybe what might come out is the word oops. world. Maybe what might come out is the word how. I wonder if a better example would make more sense. Like, hello, how are I don't know. See, this is where I get in my head here. I like should like plan these videos more. I don't know if anybody has an idea for is this scenario. How's this scenario so far? Okay, there's very little going on in the chat that I just checked. It's all right. I'm gonna keep going with this. Maybe the model somehow knows who I am. Maybe it's going to say Dan. And if you've ever worked with a language model before, you might notice that even with the same prompt, you don't get the exact same output every single time. That's because what a language model is actually predicting is it's predicting that's because yes, the language model what it's doing is predicting the next token or an oversimplified way of thinking of it as the next word. It's selecting from a map of probabilities. table of probabilities. Maybe world is picked 80% of the time. Howal is picked 15% of the time. Dan is picked 5% of the time. I'm thinking maybe world is 80%, how is 15%, Dan is 5%. That's what the neural network architecture is doing. It's encoding the input text and maybe I'll come back and do a deeper dive into some of these pieces in another video. Processing it through the neural network and the output of the neural network is a giant table of all the possible next tokens in its vocabulary along with probabilities. So the question this begs the question, how are these probabilities calculated? And then why would you sometimes even want to adjust these probabilities? Maybe you want to flatten them to make Dan different words. Maybe you want to flatten them to make the lower probability words higher probability. And the We're just editing this video a lot today. It's fine. Maybe you want to flatten them so that the lower probability words again I'm just conflating the concept of a token and a word. I really I don't like how I just glossing over that. Although this is not a full like LLM explainer. I think that's fine.

### [1:30:00](https://www.youtube.com/watch?v=ZfHVlMqJIxY&t=5400s) Segment 19 (90:00 - 95:00)

Maybe the lower level prob maybe you want the I mean I could always record a correction for this later, too. Maybe the lower probability words you want to be even more likely. I don't even forgot what I was there. Maybe you want ah like magic. This is not the most important point. Maybe you want to flatten the probability so that the lower probability words like Dan actually have a higher chance of being picked and the higher probability ones have a lower. Or maybe you want to stretch the probability so that world is even more likely than 80%. This is what I'm going to talk about. This is a concept known as temperature. Oh jeez. So these are the things that I want to explain in these video. So these are the things I want to explain in this video. Softmax being the mathematical operation that takes the output of a neural network and turns it into probabilities and temperature being a property that can be applied mathematically to those probabilities to adjust them in one direction or the other. However, it's important to note that these concepts of softmax and temperature are not the exclusive domain of large language models. And that's where I want to travel back in time to many, many years ago. An old coding challenge on the coding train. Markoff chains. Okay, we got through that somehow. Masu, this like warms my heart to put it up on the screen to no end. And I'll tell you something. Um uh my kids are like older than yours but not by much but like you know they were like you know um yes I'm having the same experience except instead of a bo baby baby to like an 11year-old it's like remember they were like six and now they're 16. It's it's crazy. Um yeah. Okay. And um All right. So let's keep moving here. Okay, so possibly pre I'm not going to rehash the entire coding challenge of the marov chain, but let me give you a quick highle overview of what a marov chain is and let's look at the code example for it. I don't know why that timing of that was unfortunate. I'm not going to rehash the entire Marov chain coding challenge. You can go and watch that for yourself if you're interested. But let me give you a quick highle overview of what's happening in it and talk about what I'm going to change to add the concepts of soft max and temperature to it. Okay. That's such a good reference, but All right

### [1:35:00](https://www.youtube.com/watch?v=ZfHVlMqJIxY&t=5700s) Segment 20 (95:00 - 100:00)

this wonderful post by Victor Powell, I think, does the best job explaining the core concept of a Markov chain. We have a system that has a number of possible states, maybe just two states, A and B, and it hops from one state to another based on probabilities. So in this case, A has a 50% chance of staying at A or switching to B, and B is the same. I could. Now I've changed the probability to be 90% likely to stay in A and you can see how the behavior of the system changes. Markoff chains can be used to study I just got a message that I should Oh yeah. Um Oh, there's a Veritasium video. Yeah, Markoff chains can be used to study financial systems, weather systems, all sorts of types all sorts of systems where there are sequential events where there's sequential data or sequential events and you might and you want to investigate predicting the future or analyzing the sequence based on historical data. That's probably enough. They are also a markoff chain can also be a language model. A word or a token can be con a word can be considered a state. The current state of the system is the word hello. What is the next word? Well, that's the same question as asking what is the next state? So, in a Markoff chain, we might store a list of all of the words in our vocabulary and then for every word a table of probabilities for what could be the next word. If we execute that And honestly the you know and really a large language model is no different than a marov chain on I mean that's very different in lots of ways but at its core it's the same behavior. What is the current state the input text what comes next? The difference is with a markoff chain we're actually storing a giant spreadsheet a giant table of all the probabilities. A large language model is using a neural network to estimate essentially all those probabilities. And you could not possibly load the same amount of data that you might train a large language model in with a Markov chain. But I digress. I'm kind of off um but I digress. I want you to make that con. I digress. I mean, I think this is an But important thing to think about. One of the beautiful things about using a Markoff model for text is if you're asking the question like why did it say what it said? We can literally open the hood of the Markoff chain and look at all the probabilities. When you're asking why did this language model say what it said? Well, we don't really know. We have the weights of the neural network and do we really know what data it was trained on? There's a lot more there's a much bigger loss of that connection between the predicted text, the output text, and the input text. No, data that was actually used to drive those probabilities. Let me just sort of talk about that again because I'm working through in my own mind what it is I'm talking about in this video. Yeah, Buildables gives a very good piece of feedback here, which is the hello is like a kind of bad example, but it's sort of too late side to side. Sort of too late. This video is what it is. This is me not thinking this stuff through. Also, this is way too much. Like, I kind of didn't, but it's fine. This is what I'm talking about today.

### [1:40:00](https://www.youtube.com/watch?v=ZfHVlMqJIxY&t=6000s) Segment 21 (100:00 - 105:00)

Um, a markoff chain is a language model. You can think of the current word in the language generator the or the input word here hello as the current state of the chain. I'm trying to start like I walked over here with an idea of what I was talking about, but now I'm my I'm in my head about the word hello versus the word I. Um, okay. I'm just gonna keep I'm like checking the chat like somebody will just tell me everything that I need to know to make this all make sense. Okay, I could actually replace LLM here with Markoff model. Why? Because we could think of the input text to a Markoff model as the current state. I have the word hello. That's the state of my system. The Markoff model could then be holding the entire vocabulary, a table. an table of the entire vocabulary with probabilities for every word that might follow the word hello. This is one of the things I love about Markov chains. models. If I have a Markov model generating text for me and I'm wondering like why are the why is it saying what it's saying? I can open the Markoff model and look at the giant table of probabilities. This is very different than a large language model that is a neural network-based system that's just estimating probabilities. It's been trained with massive amounts of data and by the time you're using it, that data is so far away, you can't make those connections anymore. That's actually all I wanted to say. So that's much better. Um, however, with a marov chain, however, a marov model that's modeling text typically doesn't do it one word at a time. Um, oh, is this what I want say? Yeah. All right. Let me say that over here. So, oops. This is not what I wanted to look at. Just gonna close that. It's not what I want to look at either. Okay, Direct. I think that's a nice way of thinking about it. Yes. Okay. Back to here. So, this is my Markov chain that's generating new titles. So this is my Markoff chain example which I think was made so long ago. It was before I even called this channel the coding train. So I think it's trying to name the channel. Let's see what we get. Code red coding flock. Go the sky. The code kitchen. I like that. I should be I'm rebranding the channel everybody. The code kitchenal. So, one important distinction is that a Markoff model for text isn't typically looking at one word at a time. It looks at this concept known as an engram. Let's see if I can find that here. Uh, references. Yeah. Well, this Oh, these this is actually great. Uh I should have this is actually a great this is a great visualization that I should have used better than mine over there. Let's go back and forth between these. Um let me come back from here. Um I'm just so this matcha this is going to go before me do looking at the code example be really fast. This is a really who is this by Chris Harrison. This is a really wonderful Okay, let me just do this. Just look at these. This is a really wonderful Is this of Oh yeah

### [1:45:00](https://www.youtube.com/watch?v=ZfHVlMqJIxY&t=6300s) Segment 22 (105:00 - 110:00)

Google's tri. Yeah. This is a wonderful visualization of this idea by Chris Harrison. It's using data from Google Books. I think it's from Google Books. Okay, it's using data from a 2006 goo it's using text from a 2006 Google data set and visualizing the frequency of words that come after the word I for example have being most likely m second etc and then further words after that. So that I think will be nice to include. Then I can come back to here do this thing. I should also mention that if you watched a marov model for text a marov chain doesn't typically consider just one word as the state. It usually looks at two words or three words or four words and tries to use that to predict what would be the next word. And in fact, in my example, I didn't even use words at all. This is a character level markoff chain, meaning it's looking at just sequences of three characters or four characters in a row and then creating a table of probabilities of what character would be next. sequences of two words or five words, sequences of three words or four characters, those are known as engrams. And the order being uh term order is usually applied to what is the value of n. Are we looking at collections of three of four? That's the order. So in this code, you can see there's a variable called order and then there is an array called n. Sorry. There's a variable called order and then there's this object called engrams in the console here. I'm logging the engrams object and you can see it is okay. I should say that it's a very small. So, so this demonstration loads just a text file with a very little text in it. I just realized I should add more stuff to this, but it's loading all of these names of what could be a YouTube coding channel. and then counting. How is it how's it getting 48 instances of COOD? I guess there really 48 here. No. Oh, this is much longer. Okay, Sorry. I uh let me just do this. Okay, this example is loading this text file which has all sorts of madeup names for what could be a name of a YouTube channel about coding. I guess then if we log that then if we look at that engrams object we can see okay then if we look at the engrams object in the console notice the order is three I have every single sequence of three characters including spaces all listed here. And for every one like G space C, I have an array of all of the letters that come after it. So look at this. O appears five times. So why do I store O five times? I need a better example. Rainbows in there so many times. Yeah.

### [1:50:00](https://www.youtube.com/watch?v=ZfHVlMqJIxY&t=6600s) Segment 23 (110:00 - 115:00)

Well, let's look at space CO that has D M D. The D is there because the word code is coming after space C 0. The D is there because there's so many instances of the word code or coding M. Yes, the M is there from coding comrades. There's also an L from where? Where do you see that? Where's the L? Oh, coding colors. And there's even an L from coding colors. Um, okay. So, that's what I wanted to show there, right? The markoff chain. The markoff chain then generates text. Yeah. Uh text by looking at what the current state is and then sampling from all the options of what could come next. I don't know why I close this because this is what I want and where oh yeah and what my code does is it just if it's the state is space co it picks from this big array randomly so D is more likely to be picked because it's in the array many more times. This is the prime. This is a very crude and simplistic methodology that doesn't allow for I want to go over to the whiteboard. I have it recording. But this is a very crude and simplistic methodology that doesn't allow for the concept of soft max and temperature to be able to adjust the behavior of the marov chain itself. And I assume we'll edit out me wiping my nose. What is happening? Okay. I don't know. See, I didn't intend to like reexlain the whole Marov chain thing, but it sort of feels necessary to get into this and I don't think people will have like watched the whole Oh. Anyway, so again, hopefully that gives you hopefully again that's the 2LDDR of what the coding challenge was. I go build this whole thing if you want to go watch that actual video. Okay, at the end of this video I'm going to come back and update my marov chain code to use softmax and temperature. But let's first just create a kind of dummy scenario where we can work out the math for these two algorithms. Okay, this is where I need some help. Should I create I don't This doesn't need to be in the video. I'll just make this up. I did like a fruit inventory thing, but I don't think that makes sense. Um I'm trying to think. I would love help from the chat here. Um I could call this like options and I could make it an array that has like name, value, score. Um, oh, we could do, yeah, I could do I

### [1:55:00](https://www.youtube.com/watch?v=ZfHVlMqJIxY&t=6900s) Segment 24 (115:00 - 120:00)

could actually do like labels because I could connect this to image classification and I could do like name, cat, score, um, you know, 13, you know, like 2. 3 three. I don't love this data structure, but this is what I'm thinking. So, like I could do something like this. Um, guess this is what I'm for lack of a better idea. Oh, and this should not have the quotes around it. Okay. And um there's a nice um this it Yeah, there's a nice I covered this in uh Yeah. Where's the soft max? Oh, the whole point of Right. this is you don't apply this. Oh, right. This is the soft max. Yeah. Yeah. I don't know if this is useful. Um, okay. I think I'll do this. Um, Okay, another place where you might have encountered the softmax function is if you've worked with the ml f another place where the soft mass max function has come up. Okay, I'm going to leave this here. I've most likely alluded to or talked about the softmax function in my image classification videos. I have several of those about working with ML5 and TensorFlow. js as well as the teachable machine system. An image classifier will output a list of labels and probability scores for those labels. Oh, is this 90% cat, 9% dog, 1% turtle, but what a neural network actually outputs is not those probability scores. The output of a neural network or the output of the neural network is often referred to as the raw logits. This is a post about teachable machine which I've covered in extensively on my channel and in this post show this down here wait. Oh yeah.

### [2:00:00](https://www.youtube.com/watch?v=ZfHVlMqJIxY&t=7200s) Segment 25 (120:00 - 125:00)

If you give an image, if you give the image classifier this cat, image, if you give, if this image of a cat is passed into the image classifier, you get this giant spike in this visualization of all the probabilities all around different categories of cat. But the raw output of the neural network looks like this. This is the cat logits. The score, the unnormalized score of some value between 10 and negative -10 across all 1,000 labels that are part of this classification system. So, how does it work to go from logits to probabilities? This is going to be such a long video, which was not my intention. I've lost so many viewers. It's fine. I'm working through this today. I'm learn I'm learning. This is good for me. I'm gaining knowledge just trying to explain this and figure this out. But boy, did I just take like 20 minutes just to get to the part of like looking at the softmax. I think it's going to move faster now. It's fine. Okay. So, I have a kind of crude p5 js sketch with a simulated output of a neural network. I have an array of labels that each has a name and a score. Now, one thing that I could do, let's just change this turtle score to 023. I'm going to write a quick algorithm to normalize all those scores. I can add up all the scores. Take ev scores into a sum. Take every score divided by that sum and I have normalized probabilities 67% 26% and about 6%. So this would be one option and certainly an option that I've used in other. So this is one way just to take any arbitrary scores and normalize them. It's not softmax though. Why is this not it's not softmax. And good enough? Well, one is neural networks will output negative numbers. So if I were to run this code again with a negative score, I get nonsense. Now in the probabilities, it doesn't work. And then the other note I made is Oops.

### [2:05:00](https://www.youtube.com/watch?v=ZfHVlMqJIxY&t=7500s) Segment 26 (125:00 - 130:00)

Wait, hold on. I just have to point this out. If anything, even if this stream does not get edited into a video, we have fixed a bug al together in transformers. js. There was uh so I can update to 3. 7. 4 when I demonstrate that. And the temperature I was trying to show this in class actually last week and I was like I can't figure out make the temperature work. That's great. Okay. Uh so uh K Vikman just to give you a little bit more context ZOVA is yes a developer on transformers. js but he's actually the person who created the library and essentially the main I don't know how many other people are contributing to it but it's all him. Okay. I would say the here's my other question is the other reason for softmax. softmax, which we'll get to the math in a second, but I'm just looking at the definition here from Wikipedia, it is that is an exponential function. So, it's allowing for more possibilities of how you might adjust the relative scale of those probabilities. And you can see here everything that I've said in this very, very long video is summarized in this one sentence. The softmax function is often used as the last activation function of a neural network to normalize the output to a probability distribution over predicted output classes. Let's now put the math into our code. What? I have to find the math. Let's see. Okay. All right. Here it is. Yikes. Let's put this on the whiteboard and explain it. I need to be able to see this. Okay. I want to say Okay. All right. Oh, whoops. No, that's not right. E to the Zi I'm just I just don't know the notation. Oh, J. H. Such a big eraser. That's what it should be, right? Yeah. Okay, now don't be alarmed. We're going to get through this together. It's actually very simple. Okay, So, first of all, this uh lowercase sigma, that's what that is, right? I got to look up Greeks. The problem is I have to get this right. It's lowercase. Yeah, it it's used for soft max. It's usually

### [2:10:00](https://www.youtube.com/watch?v=ZfHVlMqJIxY&t=7800s) Segment 27 (130:00 - 135:00)

isn't that usually like sigmoid. That's the sigmoid function. I'm just trying to understand that. Um I know I just have the whiteboard over there. The standard unit softmax function usually sigma is for sigmoid right so what's going on here maybe what I should do is just write this as soft max of every zi I need some help with this. Why is this a sigma here? Is that just the label for that the Wikipedia page is using? Uh, I guess this is just using. Yes, soft. I know. Greek letter usage is overloaded. Oh, yeah. Whoops. Thank you, I'm going to rewrite this out. I think it. I'm just going to say soft max of Z. I think this is a better way to write it. Okay. Yeah. Okay, the Wikipedia page is using the uh sigma Greek letter to denote softmax, but I think it's a little clearer for me to write softmax. So, what is Z? Z is my array of raw score values, the logits. Where's my code that I'm writing these values here? So, let's put them. Oh, this is supposed to be a negative number. g -3. I already forgot what they are. 2. 3 0. 9g3. All right. So for every i this is really the soft max of any zi equals so zi. So if we start with i equals z we're getting this value and I'll talk about what e is in a second. E to the 2. 3 value divided by the sum from and I know I'm counting from zero because it's an array. So the J equals one. We could say J equals zero. But essentially we need to count through all of these and sum them together. So essentially we're still just doing this elaborate looking formula. And if you missed it, this

### [2:15:00](https://www.youtube.com/watch?v=ZfHVlMqJIxY&t=8100s) Segment 28 (135:00 - 140:00)

um capital sigma, that's the Greek letter sigma is the standard symbol for summation from one to k or in the case of an array 0 one and two indices. We're still just doing a sum of all the values and dividing each individual value by the sum to normalize across zero to 100%. The difference is instead of just summing the raw values, we're taking using the raw values as the exponent as the exponent. E to the 2. 3 power plus E to the 0. 9 the3 power. Oh, shoot. Ah, I wasn't I didn't press that button, which is um fine. We will Did when did I not press the button? Never. What was the last thing that was in the um that you saw? Did you see me write the new softmax formula out or No, it's okay. I'm going to repeat this anyway. I guess I'll just do I'm gonna do the whole thing because who knows. Uh, okay. I'm gonna repeat the whole thing. You guys still haven't caught up to me in the real time. This is good because this is the heart of this. Okay, just check the chat. Oh, it was when I changed the sigma symbol. Okay, I didn't get the whole anything. Okay, hold on. We just do. I'm going to need some help explaining why we use E, but I guess uh as opposed to any other number. If anybody wants to offer me a short explanation in the chat or reference. Okay, let me write this out. All right, let me write out the softmax formula here. Now on the Wikipedia page, the notation looks a little different. It's using a Greek letter for softmax. But what I'm trying to show you here is softmax of any Z equals Oh, and what's Z? Right. Okay. We uh Christian is saying when you introduce temperature it's changing E. That's kind of where I'm going to. But why do we start with E? I guess I mean it's the it's Oilers's constant. It's the base of the natural logarithm. It's actually I guess it's just sort of like it's like a mathematical standard. It's a way that I'll kind of gloss over it. But where was I coming to here? Um uh yeah okay Z is the array of raw score values the logits that are coming out of a neural network. Oh, I didn't press the button. I might have gotten these. I think these match the numbers in my code, but I'm not 100% sure.

### [2:20:00](https://www.youtube.com/watch?v=ZfHVlMqJIxY&t=8400s) Segment 29 (140:00 - 145:00)

Hold on. Let me do that again because I wrote them so big. I don't have the right amount of room. So Z is an array with one, two, three values. So if we start with index zero, so I is zero, we've got the value 2. 3. And I realize I'm saying the standard I guess way of notating it here would be from one to K, one to three. But in coding we use counting from zero. So you just hopefully that's not going to be too confusing here. Let me keep going. So the soft max of 2. 3 equals E. And let me say what E is in a minute. Just hold that thought. E to the 2. 3 power divided by okay what does this mean? This is capital sigma the Greek letter for summation. So we're going to sum from J equals 1 all the way through K being there's K possibilities here. Every single Z value. So also e to the 2. 3 plus e to the 0. 9 -3rd power. And we're going to do this for every single one of these. Just this a little nicer. Oh jeez. God, this board. So in the end, we're still doing so this is not that different than the standard normalization. You take the value and divide it by the sum of all the values. The difference is each value is the exponent to the equation e to that value. Um and just hold on. Just looking up to make sure I'm right. Yeah, I'm just looking. You should also subtract max value from the logits before taking exponent since exponent can be numerically unstable. Yeah, that's a good point. So, I'll give it. So um okay so what is E in truth I could have put any number there instead of E okay what E is Oilers so what is E okay let's point out a couple things what is E is Oilers's number. It's actually the base for the natural logarithm. Maybe this is just some mathematical word salad to you. I'll include some references in the video's description, but essentially you can think of it as a mathematical standard. What we're actually going to do is adjust what the eventually do with temperature is adjust the value of E. But for now, it's a starting point. The point is we need something to the exponent. Uh what was I? What are 2. 718? Yeah, that's what I was making. Oilers's number. It's around 2. 718 and it is like pi an irrational number where the digits never repeat with any

### [2:25:00](https://www.youtube.com/watch?v=ZfHVlMqJIxY&t=8700s) Segment 30 (145:00 - 150:00)

kind of pattern. This number is the base for the natural logarithm. So it's kind of like a standard in mathematics. There's lots more to say about it. I'll include some more references to Oilers's number in the video's description. The point is more that we need to use these values as the exponent that has this wonderful thing. What is e to quality of allowing for two things. One is negative numbers. E to the -3rd power is the same as 1 over E to the 3 power. So with a negative number, we're getting a much much smaller value and that will work. So everything becomes a positive number but this value is really tiny and it works well for probabilities. The other thing is because these are exponential functions, the higher the score, the greater the probability is. And this is something that we're going to get to tune based on changing what the base of this exponential function is. But let's at least go and put this into our code. Was that enough of an explanation? Did I get anything wrong? Adjust the value of E. Did I say E? I said adjust the value of E. Let me just correct myself. I just said adjust the value of E. That's not what I meant. E is just this value. I mean adjust the value of the base of each one of these. I don't know what to call this. Adjust the value of the base in this equation. I don't by the way when you leave a comment if you really need me to understand something that I got incorrect you've got to provide a lot more context just so you know for this purpose the meaning of E is not important okay I'm going to keep going it's 2:30 holy smokes what is wrong with me okay um hopefully this is good I you all tell me later if you're learning something. Okay. Wait, what did I Okay. Now, you might be wondering, and guess what? Luckily for us, look, and let me just make sure this is still in p52 because some of these like overloading of JavaScript functions. Yeah. Okay. And guess what? Look at this. Right there in P5, we've got the exp function. Calculates the value of Oiler's number raised to the power of a number. Perfect. So, if I go back to my code, we already actually had the code for adding up all of the scores and dividing by the sum. All we need to do is add up all the scores as the exponent of e to the power of and then divide the and that's the new sum and then add the exp function here as well. And now even with the turtle having a score of -3, it's got a very low. 3% or point4% probability. 20% about for dog and 80% about for cat. Okay, great. Okay, so I've now explained softmax and we have it

### [2:30:00](https://www.youtube.com/watch?v=ZfHVlMqJIxY&t=9000s) Segment 31 (150:00 - 155:00)

to truly illustrate how this is working. working, I think we need to add the sampling to this sketch. All right. Now, I think to really demonstrate how this works, it's would be important for us to have sampling added to this sketch, meaning pick one of the labels based on those probability scores as well as some kind of visualization to show the frequency of how often those are being picked over time. Now I have already I'm already trying to do too many things in this one video I think kind of half explaining marov chains and language models and then the softmax function and look at image classification but hopefully this is interesting to you and you're following me somehow. Good news is I've already made an entire video about weighted selection, the sampling process. And I have covered this extension in the nature of code book in both the chapter on randomness as well as my chapter on evolutionary computing where we're picking parents in an evolutionary system based on their fitness scores. So I can actually take this weighted selection code and bring it into my sketch. But I can't help myself. Let me give you a very short overview of how this algorithm works before I adapt it into our code. Let's say we have the three labels and their probabilities. Again, our assumption here is these were all calculated via the soft max function. What if I visualized these probabilities in a rectangle? I don't know what I know where I'm going, but I don't know. Seems weird to say rectangle here, but that's what I'm doing. Now imagine okay so we can see how this maps cat takes up 80% dog takes up 15% and turtle 5%. Now imagine I pick a random number between zero and one. That random number is going to land within here 80% of the time and so on and so forth. That's essentially what I'm doing in the weighted selection function. It picks a random number. The funny thing about it is how do you know where it lands? So the way that I do that is by starting by picking that random number and subtracting the probability scores from that number and seeing how many of those I have to subtract before I've left the rectangle. Essentially gotten between and I wrote this in the wrong way. So if we imagine I'm starting at one and I have to escape out of my algorithm. Wow, the screen keeps moving. escape. I think maybe I

### [2:35:00](https://www.youtube.com/watch?v=ZfHVlMqJIxY&t=9300s) Segment 32 (155:00 - 160:00)

almost want to write it the other way, but Oh, actually I'm trying to like think about how I'm doing. I hate Okay. So if I pick a random number between 0ero and one like let's say I pick 0. 75 kind of land over here. If I subtract 08 from that I'm already below zero that means stick with cat. If I pick a random number to start with, where am I starting this race? Essentially, from down here, like 0. 014. No, no, no. This makes no sense. Oh, it's the uh Okay, you know what? Okay, I have to do this again. I mean, I was fine when I just didn't try to explain the escaping thing. I should have just left it there. I just go look at the chat for a second. Yeah. If the random number is less than equal to cat, the choice is cat. If it's higher, add the next value in and check again. Yeah, that's the right way to explain it. Yeah. Um, let me just do this again because I've already like taken so long. I made a very hard long what a long video here to edit. I'm like starting to sweat. I'm getting like clemp here. Um, I've already forgotten what I did. I lost my thought. I mean, I think I was fine when I wherever I was just picking the what number between zero and one kind of gets the idea. Keep um All right. I think there was a stopping point in there where I don't have I was trying to like do the other order and I think I should just stop with where I was. I'm just gonna explain it one more time more succinctly, but I don't remember. Um, okay. I'll just start actually. Okay, I'll just start from I'll start from here and I'm going to put the cat bigger in the bottom. Oh no, I just totally broke this pen. I mean, I didn't break it. It has these like tips and I like broke the tip which I could replace but I think I should just get a different one. I won't use the fancy one. Okay. So now I can't press a button to change it. That's okay. It's funny. Now I'm doing it in reverse but doesn't really matter. Okay. So you can see here I've drawn this rectangle with each section being scaled according to the probability value. Now imagine if I were to pick a value between zero and one. If I pick, you know, 0. 7, I'm going to get a cat. So 80% of the time I'm going to pick a value that's within here. 15% of the time here and five percent of the time here. That's essentially how my weighted selection function works. Gonna leave it at that because right here you can see I'm picking a starting point. Now this is a little bit funny but as long as start is greater than zero subtract the fitness. Now, first of all, the fitness here, I'm

### [2:40:00](https://www.youtube.com/watch?v=ZfHVlMqJIxY&t=9600s) Segment 33 (160:00 - 165:00)

going to change my code in a second. Actually, let me just rewrite this code to use the score values of this particular sketch. These are called labels. And I'm not actually using the score. I'm using the probability and I'm gonna say return labels index. name. Okay, once again, what's happening here? I pick a random number between zero and one. As long as that value is greater than zero, I'm going to keep going. And I start subtracting out the probability values and moving through that array. How many of those do I need to go through before I get below zero? Well, you can see if I picked 7 and I subtract 08, I'm already below zero. So, pick the cat. By the way, the order of this doesn't matter. If I pick 7 and subtract 0. 15, I'm still above zero. And then I subtract 05, It's only if I pick like 0. 12 that I'm going to be out of the array. I guess the order does matter. Does the order matter? No, the order doesn't matter. This will work out whatever order I'm doing. But I'm noticing if I were to pick 0. 9 and subtract 08, I'd still be above zero. I'm at 0. 1 and I go to subtract 0. 15. Now I'm below zero. So pick dog. So I'm only going to get dog if this is the order if I pick values greater than um 08 and it would have to be greater than 0. 95 for me to get turtle. This algorithm always hurts my brain. So this should be a function that every time I run it picks a random value from my labels array according to the probabilities. Let's add a draw function. Let's run this sketch and see what gets picked. Sorry, there's no I the variable name in this uh function is index. So you can see this isn't the best visualization, but you can see that cat is getting picked quite a bit more often than dog and turtle comes up very rarely. Let's visualize this in a graph. Let's give every label a count. And let's actually not return the name. I don't care about that. Let's just return the index. Then let's draw a graph with growing bars every time it's picked. So each bar will be so the width of each bar is sized according to the width of the canvas.

### [2:45:00](https://www.youtube.com/watch?v=ZfHVlMqJIxY&t=9900s) Segment 34 (165:00 - 170:00)

So the height of each bar is the number of times it was picked count. So the y position because it's I want it to grow from the bottom is height minus h with height. And we'll give these a stroke of uh zero and a fill of 127. And draw a white background. And I don't know what I was doing here, but I need to increase the count of the index that I pick every time. Great. Let's adjust those score. We can Great. We can see how cat I mean turtle has never been picked or maybe it's been picked a couple times, but we could give turtle a higher score just to be sure this is working. And now we can see the probabilities have adjusted accordingly. Okay. Um, great. We're almost done, people. Almost done here. Um, all right. Temperature. Oh my goodness. Where's that formula? Okay. I mean, I wasn't actually the way that I do temperature. So you all were saying temperature is like adjusting the base. But um the way that I do temperature is divide the exponent by temperature. That's not really adjusting the bass. That's just squeezing the I mean, it's like adjusting the bass, I guess. Why is it like adjusting the bass? Because it's like saying, let me just work this out in another place for a second. So what I think is E is to the second power divided by temperature for a score of two versus like E is to the I mean let me make this uh E is to the third power divided by temperature versus E to the second power divided by temperature. That's what I think of this and what this is. So if the temperature were a two, the temperature is two. It's like saying e uh obviously I could just divide it there, but I'm just saying it's like saying e to the 3 power divided by e^2. So it's like e to the 1 versus I mean um it's not e to the one obviously it's e to the um one and a half and then this is e to the one. But that's not that to me is adjusting the exponent. I don't think we're adjusting the bass. Oh, you're not seeing me. I was over here. Let me think about this for a second. Oh yeah, I'm totally wrong. I have become invisible.

### [2:50:00](https://www.youtube.com/watch?v=ZfHVlMqJIxY&t=10200s) Segment 35 (170:00 - 175:00)

No, no. I got confused dividing by because I got confused. I was talking about my brain is so melted. Yeah. What am I even What am I talking about? Uh what am I even talking about? I'm just trying to scale down the expon exponent. I mean it's very simple temperature. Why did I think that because I got confused because it's like neg because I was thinking about the negative before. So if I'm sc I mean is equivalent of scaling down the exponent is not really is that the can you consider that the equivalent of scaling down the base? Not really, I don't think. But anyway, I'm not going to worry about that. I'm just going to explain it. Let me come back to was here second here. Scaling down the exponent is the same as scaling down the base. The relationship between the operations just isn't linear. Yeah, that's kind of makes sense. Um, Um okay. So yeah. Okay. This the So it I think I said it got it wrong earlier. I think I got it wrong earlier when I was talking about adjusting the value of E. But maybe I can cop to that a little bit. Okay. So this is where temperature comes in. What this equation allows us to do is basically dial up the amount One of the things that softmax does by using the raw logits, the raw scores as the exponent is it increases the difference between it. I'm so tired. I got to get through the end of this though because I'm basically at the end here. One of the things that the softmax function does that we talked about earlier beyond just allowing for negative numbers is by using the raw score values as the exponent. Oh my god, the B just landed right on the camera. It would be so cool if it crawled in front of the lens. One of the all right so one of the things that's wonderful about softmax and is going to allow for temperature this concept of temperature is the fact that the raw scores the logits if you will are used as the exponent. So one, we established that's great for having negative scores, but also because this is exponential, a higher score will lead to even higher probability. But what if we want that to be increased even more or turned down a bit? Well, all we would need to do is divide this. Oh, no. Undo that. mysterious screen. All we would need to do is take this exponent exponent the exponent very hard to say

### [2:55:00](https://www.youtube.com/watch?v=ZfHVlMqJIxY&t=10500s) Segment 36 (175:00 - 180:00)

and divide it by temperature. So, a temperature of one will do absolutely nothing. And by the way, we're going to do it here as well. Of course, a temperature of one. So, the temperature of one does nothing. A temperature of two has the result of making this exponential power less. So it's squashing those probabilities. They're going to get closer to each other. That's a hot temperature. We're turning up the temperature. If the probabilities are getting less closer to each other, we're more likely to pick the weird and strange outliers. If the temperature is down, if you think of it's less than one. 5. 1, obviously it can't be zero. You can't divide by zero. Then we are increasing that exponential power and the cat is going to be even more likely to be picked. Let's look at how we add this mathematical property into our code and what it does to the end result. Okay. Uh, I'm just looking at the chat. Okay. I think did I do something wrong there? Let me know. I'm reading the backlog, but okay. So let me introduce a temperature of one. Let me go and put turtle back to -3. I didn't bother to like package softmax up into another function or think about anything at all. I'm just explaining the algorithm here. But I can now divide by temperature. And let's use a shorter variable name temp for temperature. There you go. A temperature of one. We have the same result of frequencies that we started with. Temperature of two. Look. Look how more Look at Aha. Look how dog is being picked more. That's the second one from left to right. And turtle. Turtle's got a whole little bar there. Temperature of five. Temperature of 1,000. I mean, at this point, I've squashed the probabilities so much, they're all even. it's basically even probability. It's uniform probability distribution. Let's go back to a temperature below one. 0. 5. Yeah, looks like the dog is even less frequent. 0. 1. There we go. We can see we're not even getting dog or turtle at all. Okay, that's soft max and temperature. Now before I leave you, I want to go backwards and apply these concepts to both the Markov chain and a language model. Okay, so oh my goodness, this is going to involve some serious coding here. So I'm gonna try to do this kind of quickly.

### [3:00:00](https://www.youtube.com/watch?v=ZfHVlMqJIxY&t=10800s) Segment 37 (180:00 - 185:00)

Why is this I Oh, I plus order. Okay. Okay, I'm going to do this quickly. It's a little out of scope for this video, but I'm going to update the Marov chain code to count the frequency of characters, not just add them to an array of options, which is what's happening right here. So, we just we need to check The next character is that next character. If n grams that gram has never seen that next character before, then it should have a count of one. Otherwise, we should increase its count. So, this is me calculating scores for the next character, the frequency of next characters in my Markoff model. Actually, I think it was all I had to do was that. Let's look at the Marov model again. No, I missed something. Oh, if I had never seen that engram before, I made it an array. It needs to actually be an object. They're all not a number. What did I get wrong? Oh, if it hasn't seen it before, it gets a one. Otherwise, it counts. There we go. Space CO. The next space co whenever that group of three characters appears a D follows 19 times an M once an L once and an S once so these are so once I've so now that I've built this model these need to be converted into probability scores with softmax Oh, options equal I need to get the options from for that engram which is now an object again. And then I need to get the uh do a for loop for C of object keys options. Sum equals zero. sum plus equal exponent

### [3:05:00](https://www.youtube.com/watch?v=ZfHVlMqJIxY&t=11100s) Segment 38 (185:00 - 190:00)

uh options of C and then options of C equals exponent options of C divided by sum. Right? This should be the algorithm that is looking at every engram. Call this a gram. That's kind of nicer. Every engram, these are the options that come after it. I sum up the all of the counts to the exponent power and then I just divide the all of the do the same thing. This is softmax uh for constant key op oh entries. Oh, I could do Oh, key. I can't use key. That always messes stuff up. But this is technically right. I'm going to They could be improved. All right. Let me just make sure this works. Oh, no. It won't. I don't have the sampling. So I think so now let me just look yeah that's right 99% D very small probability those yeah okay not the most elegant code. But now I have applied the softmax algorithm here. I am looking at every engram and all of the characters counts that come after it and taking those character counts and passing them to the exponent function. E to the count. Taking all those character counts and setting it as the exponent. E to the character count. summing that up and then dividing E to the character count divided by that sum. And now you can see I've set the values to those probabilities. D has a 99. 999% chance MLS a much lower probability. Okay, now what I need is the Now I need to grab this weighted selection function and bring it into the marov chain. I think it would make much more sense for this function to receive an array as its argument. I'm going to assume that array I called it options but maybe it's probabilities. It's an array of probabilities and it gives back the index uh picking based on the raw value of the probabilities. So before the previous code, hold on the previous code was just picking a random possibility from an array of many characters. So now this needs to be weighted selection and it's actually

### [3:10:00](https://www.youtube.com/watch?v=ZfHVlMqJIxY&t=11400s) Segment 39 (190:00 - 195:00)

and the possibilities array is the key value pairs of character probability. So I actually need to get the values out of that array. And then the character is the key. clunky object-oriented code here that I don't expect will make a lot of sense to you, but not the sort of core aspect of this video. What's a better way to write this? Well, there's got to be a better way to write this. I should use the entries. Okay, I'm going to use the entries. This is where entries. Okay, so possibilities is the collection of key value pairs, character probability, character probability. There's a wonderful JavaScript function called entries, which allows me to separate those key value pairs into two separate arrays. This means I want to perform the weighted selection of the values. Those are the probabilities. I'm looking for the right syntax here. I think it's fine. The weighted selection of the values, those are the probabilities. I could actually call it that. And then this is the next index. and the next character is the key associated with that next index. So this is just me adjusting the sampling process to use the new probabilities that are calculated by softmax. Let's see if this runs. Undefined. V1. That's right. What does this look like here? Oh, I'm so confused. What did I get wrong here? I just turned it into an array. What am I missing here?

### [3:15:00](https://www.youtube.com/watch?v=ZfHVlMqJIxY&t=11700s) Segment 40 (195:00 - 200:00)

Oh. Oh, I see. I don't get two separate arrays. I get an array of arrays. That's not what I want. I see. That's not how it works. I have to think about this. to go back. Okay, let me think about this. I mean, I could also just change the way this function works. Uh, yeah. Okay. Um, I'm going to do it in an awkward way. my own awkward way. However, so before go from here. Yeah, that's what Sachin is saying. That's what I'm gonna do before. I'm just gonna put those into separate variables. Okay. be now here is the sampling. Before possibilities was just an array of all the characters with many characters multiple times. Now I need to pull the one based off of the weighted selection function which gives me the index. Now I need to sample based off the weighted selection function which is assuming an array of probabilities. So I'm going to say uh probs for probabilities is the values of the possibilities array. Right? Right. The possibilities array is now sorry the values possibil these variable names are terrible possibilities the possibilities object is now key value pairs character probability so I can get an array of all the probabilities with object values and an array of all the characters with object keys. So we sample from the probabilities and the next character. Oh, the next index. That's what getting back is the characters next index. So this is the new sampling based off of that array of probabilities. And of course, there's other ways to organize the code and think about it, but this is just the way I'm doing it now. Undefined. What did I miss here?

### [3:20:00](https://www.youtube.com/watch?v=ZfHVlMqJIxY&t=12000s) Segment 41 (200:00 - 205:00)

I don't see what I missed. Oh, duh. Whoops. The whole point is I'm no longer using a random one. Weighted selection. That's going to give me the index. There we go. And finally, we can also add temperature back into this. And finally, we can now add temperature. Let's start with a temperature of two just to see something that feels maybe a little different than these names that were selected over there or generated over there. And again, let me call it temp for short. And here remember all we're doing is we're taking the raw scores and dividing by temperature. appearing uni coding me the new name for this channel. Okay. live. Change this to version 4. Now let's take a look at this in the context of a large language model. Here I have loaded in this model small to 1. 7 billion parameters from hugging face and I did a I've loaded in this model small uh version two 1. 7 version two 1. 7 billion parameters from hugging face. You can see my other video where I go through how to use transformers. js JS and load this model into a p5 js sketch. I've got the classic prompt that everyone loves to add their ask their favorite large language model. How many Rs in strawberry? Strawberries have one R in their spelling, which is definitely correct. Now what happens by the way if I just pick Now what's going to happen if I submit press submit multiple times I am getting the same exact response every single time. This is because the library transformers. js library with the text generation task when you ask it to generate by default it actually doesn't do any sampling. It just looks for the single next token. The next the single next token with the highest probability This is what's known as arg max. Don't bother to sample. Just find the one with the highest value. Pick that one every time. It's kind of like having a temperature of zero. Essentially, you've stretched

### [3:25:00](https://www.youtube.com/watch?v=ZfHVlMqJIxY&t=12300s) Segment 42 (205:00 - 210:00)

the Oh, it's essentially equivalent to having a temperature of zero. By the way, you've stretched the probability so far that the highest one is 100% and everything is at zero. If I want the library to perform sampling, I need to add this property to my generate function. By the way, I called this pipe because I am getting the text generation pipeline, but I actually think it would be nicer to name this variable what it does, which we could call it like a generator. So now I'm awaiting generating the text a maximum number of tokens 128 based on these messages that have gone in. You can see I've commented out the system prompt and I was doing some experiments with that which you're also might like to play with. Now I'm going to add do sample true. So now it will perform weighted sampling. Oops. Now you can see I'm getting wildly different responses each time. Let's lower the temperature. This is a little too hot for me. I can add another temperature property. property. Temperature. Let's give it a fairly low temperature. 0. 1. So, we can see I'm still getting some variation, but notice the lack of the the So, you can see I'm still getting some variation, but notice there's a more repeating pattern. There's maybe less variety here. I'm not getting the correct answer. That's a whole another discussion that we could have about this particular model and this kind of query. But let's just turn the temperature way up. Let's make it pretty hot. Let's set it to five. Okay, now I'm just getting total nonsense because the actual probabilities that this neural networkbased system is calculating based off of its training data is almost gone. It's just giving me random stuff. It's actually kind of awesome, although I'm slightly afraid to go read it too closely. Did we get something offensive here that I have to redo this? I see an uh oh in the chat, but I think I'm okay, right? Let's just throw in a slider so that we can adjust this in real time. In P5, you can make a slider with I think you give it the initial value then the range. I actually am forgetting I think it's the range 0. 1 to two. Let's give an initial value of one and a step of 0. 01. Actually, let's let the temperature go up to four. So there's the slider and then we can just get the value of the slider every time we do the query. We can have a low temperature. Wait a second. I might not have done this right. Let's look at the

### [3:30:00](https://www.youtube.com/watch?v=ZfHVlMqJIxY&t=12600s) Segment 43 (210:00 - 215:00)

Min max. Yes. Did it right. Oh, I had the default value of one. No. Turn it down. Oh yeah, I don't know what I did before. Let's turn the slider down. We can see I'm getting this sort of like classic pattern with less variation. Turn the slider all the way up. And I'm getting something completely unrelated to my query all about here berries which sounds delicious. Since this video is already a hundred years long, I should probably mention a couple other things. So we talked about soft max. We talked about temperature. Softmax being the mathematical function to calculate all of the probabilities based off of raw scores. temperature being a way to squash or stretch those probabilities. But you'll also see terms like top K as well as top P. And these are kind of variations off of temperature. Not really variations, but they're also properties that allow you to adjust the way the sampling happens. Top K is actually the same thing in a way as argmax. So, argmax is top K with a K of one. But what top K does, if I had a K of three and I had a thousand possibilities, I'm only going to pick from the top three. So, it's a way of allowing sampling to still happen, but reduce the number of possibilities to only the most common ones. So, that's a property you could adjust also in the generate function. as well as top P, which is like top K, but instead of only picking the top N number or top K of options, you're picking only the options within a certain probability threshold. So maybe only the ones that add up to 50% of the top probabilities would be a way that you could adjust top P. But I probably wouldn't use So, for example, a top P of 50% would use well, in this case, it would only use world because it's already above 50%, but you could imagine taking all of the most frequent options and adding them all up. And as soon as you got above 50%, you stop and let all the other ones go to the wayside. Generally speaking, I'm just using temperature if I'm playing around with this stuff and trying to make creative applications with it. uh you might get confused if you start using top P along with temperature though you absolutely could but these are some uh different properties you could use to adjust the way the sampling behaves. Okay. Have I said anything totally incorrect here because I'm wrapping this up. Poor Matia. I don't know if this will ever be a video. Um, what else? Anything else I should say? All right. If you made it all the way to this video, congratulations. Gold star for you. I don't know where this is going to live on the coding train website. Maybe in a playlist about transformers. js and machine learning or somewhere else. But if it's there, I would love for you to submit anything that you make based off of watching this video. For example, there's so many ways you could create a much more dynamic visualization of the sampling process that I think would illustrate so much more about temperature and how it works than my like crude little graph thing that's drawing here. Maybe you could. Oh, let me actually pull this up.

### [3:35:00](https://www.youtube.com/watch?v=ZfHVlMqJIxY&t=12900s) Segment 44 (215:00 - 220:00)

Oh, no. That's not what I was looking for. Not this one. Whoops. I have a version that I'll include that already has a temperature slider as well as a little bit nicer graph visualization of picking from a range of scores. I don't think I put softmax in this one. I didn't actually use softmax for this one, interestingly enough, but that's fine. You could make a much more dynamic and interesting visualization of taking a raw set of scores and turning that into applying weighted selection and adjusting with temperature. Here's a version that I'll share with you that just has a slider and a bit of fancier uh visualization, but you could probably do so much more than this and I'd love to see that. You could probably be much more creative than this and I'd love to see that. Um, well, right. The thing that I was looking for was actually think I have this in here with I'm have to go to my Hold on. Um, I thought I had one with weighted sampling. Oh, yeah. I do. Oh, I do. Yeah, this is with temperature. Yeah, but that's fine. I I don't Yeah, let me just I also have a nicer version of the marov chain with weighted selection and temperature that you could look at and be creative with. I also will include a nicer version that I've prepared with the uh I'll also include a nicer version of the marov chain that I've already prepared with the weighted selection function that takes temperature that you could use to maybe do something more creative and interesting um to do a more creative and interesting. Okay, I also have a nicer version. I also have a nicer, more cleaned up version of the marov chain example with weighted selection and temperature. Maybe there's some type of creative visualization you could do to show how the marov chain is building and picking new text or you could load your own data into it in some creative way that I'd love to see see. Um and then of course there this could be so vastly improved in terms of the interface and behavior of how this particular hello world chatbot style and then of course I'm sure there's so many creative and interesting improvements you could do off of this base transformers. js language model with temperature example that I've made. So, please uh let me know in the comments. Did this video teach you something meaningful? I don't know anymore. And I hope you enjoyed it and share with me what you make. And I'll see you next time on the coding train. Okay, that is the end of this monster of today. so much for my Q& A and going and doing like a P5 video, but you know, I feel like this was important for me. I had to get some of these things out of

### [3:40:00](https://www.youtube.com/watch?v=ZfHVlMqJIxY&t=13200s) Segment 45 (220:00 - 223:00)

my system. The basian text thing, this thing. Um, I've got to run Oh, how have I been here for like three hours and 40 minutes? That's insane. I need a different process, but this what's happening once a week. Zenova is still here. I don't know what time it must be the middle of the night for you. But we did invent here berries. So that's something. Stamperia and Strombus being mentioned at long different uh definitions. Anyway, all right. I gotta go. Did I miss anything super important? Robin, thank you for the nice comment. Um, thank you everybody for being here with me during all of this time. I'm going to do my best. So, let me just say uh thank you. I was going to say if anybody wants to help me figure out how to get all this stuff onto the website, come join the Discord codingrain. comisord. Um if you want to support the work that I do, uh there's two great ways to do it. One would be to buy the nature of code book. Another way is you could sign up for Nebula. Uh code Nebula. I think it's here. See if this still works. Yeah. Uh all these videos will also be on Nebula ad free. So you can go to go. nebula. tv/codingtrain. Um Nebula's amazing and has all sorts of original content and wonderful curated selection of YouTube videos. Uh you won't find, you know, you won't have to navigate through the AI slop there. Um and uh yeah um those are things that I'll say. I need to have a talk with people who have signed up for YouTube memberships and p the Patreon which sort of is still lingering there. I do not think it is realistic to do a sticker mailing this year uh based off of I mean my time how long and complicated and expensive that was last year. How I can't get anything to anybody internationally who knows what tariff nonsense is going on. So, it still says in all those places, if you sign up, you will get stickers in the mail. I'm gonna have to figure out some type of digital sticker to send to people and hopefully continue to hand out stickers to supporters and people when I meet them in person. Um, but if anybody has any thoughts or feedback about that, let me know in the Discord, which um yeah, so I got to go. I want to like hang out and chat and answer questions, but it's been four hours. See you on next Monday. There will be a live stream. I don't know what I'm going to do. Hold on. Let me just give me a second here. You stay there. I'm You could go, but just for my own edification, I can't believe I'm going to sign out here. No, don't clear the session. Um, I want to sign out. How do I sign out? I guess I can't. You know what? Forget that. I'll figure it out later. Um I'm gonna go. Okay. Goodbye everybody. See you next time. Gonna press finish. No outro. Stopping with all the outro music stuff takes too long. Okay. Goodbye.

---
*Источник: https://ekstraktznaniy.ru/video/20667*