Stanford CS221 | Autumn 2025 | Lecture 14: Bayesian Networks and Learning

Stanford CS221 | Autumn 2025 | Lecture 14: Bayesian Networks and Learning

Machine-readable: Markdown · JSON API · Site index

Поделиться Telegram VK Бот
Транскрипт Скачать .md
Анализ с AI

Оглавление (17 сегментов)

Segment 1 (00:00 - 05:00)

So we're going to continue talking about Beijian networks today in particular learning the parameters of a bay network. So let's do a quick review of Beijian networks. Um so remember when we talk about Beijing networks really it's a means to define a joint distribution over a set of variables and you could think about this as a database that specifies facts you know about the world. So in particular you define a set of random variables um for example in the alarm example you would have three random variables B and E for burglary earthquake and alarm. Then you define a directed asyclic graph over these variables um by connecting variables that you think are related to each other. In particular um one causing the other and then for every node you define a local conditional distribution uh x i given the parents of x i. So in particular this is for every node not every edge. And in this example you have probability of B given nothing uh probability of E given nothing and probability of A given B and E which are A's parents. Okay. [snorts] So each probability local conditional distribution looks like a little table also can be represented as a tensor which specifies for every possible value of the um that this variable can take on what is its probability and for condition of when you have parents it's for every possible assignment to its parents a distribution over that value. So for example here for b= 1 and a e equals z I have a distribution over a which puts probability one on a equals 1. Okay so finally you have these local conditional distributions um you multiply them together and you get the joint distribution over all your random variables. So in our example, you multiply P of B time P of E times P of A given B and E. And that gives you the joint. Okay? And the joint distribution in this example looks like this. If you sum all these numbers, you should get one. Um not get necessarily one, right? Because this is a conditional distribution. For every B and E, if you sum A, you get one. And here this is a joint distribution. If you sum over all the assignments you get one. Okay. So that's a basian network. Um we've seen this uh twice already. So now then we start talking about probabistic inference. So once you have a database what do you do with it? You query it and that is the task of proistic inference. So you might ask questions such as what is the probability of a burglary given that the alarm went off and we did worked on this. We tried three different algorithms. We did exact inference where we directly worked with this um this probability table. You select a equals 1. So all the rows where a equals 1. you marginalize out E which are the variables that don't appear in your query where you collapse um rows that differ uh only in E and then you divide by probability of the evidence. Okay, I'm not going to go through this again but um that's the way you do uh exact inference. Then we looked at two approximate uh randomized algorithms uh rejection sampling where you write a probabistic program that spits out samples of B, E, and A according to this joint distribution and then you just reject all the ones that uh don't satisfy A= 1 and you count the number of times B shows up to be one versus zero and that's your estimate. Okay. And we also looked at gib sampling where the idea is that you start out with a full assignment to be E and A and then you iteratively uh change the value of each variable condition on the rest. So you fix E and A, you update B, you fix B E. And you don't update A because A is fixed to be one. Okay. So those are procerence algorithms that we cover. There's many more that um you know if you take 228 I think you can get more uh depth in that. Okay. So then um

Segment 2 (05:00 - 10:00)

we talked about conditional independence which is more about understanding the properties of this uh condition uh this joint distribution. So the question is are two variables A and B conditionally independent given another variable C? And um I'm going to present this uh in a slightly different way today than what I did last time. Hopefully this will be a little bit clearer and I'll um go through some more examples as well. So the answer is yes if and only if every path from A to B is blocked um presumably by nodes in C. Okay. And a path is blocked if it contains one of the following patterns. So if you're trying to get through from here to here and you have X going to C going to Y for some variables X and Y which could be A and B but they could be other variables. And so if you're conditioning on something here then you can't get through it's blocked or if you have a C that's pointing out to both X and Y um if C is condition on it's also blocked. [snorts] Um and then finally whenever you have this um V structure um where the arrows are going into some variable Z it's the opposite. So this is uh blocked um unless you condition on something Z or a descendant of Z. Okay. So um I'm going to go through some examples. So um I'll come back to these uh these this rule in a second. Okay. So um last time we talked about two uh different examples. Um let me just write them on the board again just to um make this clear. So A and B and let's see C. Okay. Um so in this case are A and B independent yes okay A is independent of uh B um in particular I don't condition on C okay and uh and this appeals to rule three here u because this is blocked if you're conditioning on something uh but you're not condition sorry it's blocked if you're not conditioning on something. Okay. So that means you can't get across. If you're blocked that means you're independent. Okay. And if you have a c and b this is this and you uh let's say you condition on c. So I'm going to use shading to denote that I'm conditioning on a variable. So is A and B independent given C? Yes or no? Okay. No. Right. Um and this is explaining away remember the burglar earthquake and alarm. If you see the alarm that makes the burglars burglary and the earthquake correlated or anti-correlated. Okay. So that's one set of um examples. And the other set is if you have C pointing to A and B. So here um are A and B independent? No. Okay. Um and the way to see that is that there's a path here, right? So um if you look at the case uh two this you're able to get through which means that they're not independent whereas if you have um if you condition on C then these will be independent condition on C um because this shaded node blocks okay so if you just remember the this sort of these contrasting pairs then I think you get basically 90% of um what's happening. Let's uh let's play a game. Okay, so I have A and B and I'm going to add things to this network and you tell me if A and B are independent or not. Okay, so independ uh maybe let's do this. Um raise your hand if you think these are independent. Okay, good. So Okay. Um

Segment 3 (10:00 - 15:00)

Um are they independent? This is not independent anymore. And remember the third rule here is that if you condition on this, they're definitely not independent. But it also holds for any descendant. Um Okay. How about this? Um so they're not independent. Um so let's let's undo that. Okay. So what about um this? So are they independent? okay, this is too easy. I'm gonna try to make this um Okay, so I need to get rid of that. Um so are they independent? Okay, they should be independent, right? Okay? Because if you ever see kind of these unobserved leaves, you kind of pretend they don't, you know, exist and then there is no path over to B. Um because D is blocking both this path and also this path. Okay. And um and this is clearly not independent. Um and if you block it, it should be independent. Okay, I think uh hopefully people get the intuition and the whole point of this is that given a basian network um generally what you do with a basian network you have a query you condition on some variables and then it allows you to see based on that evidence what is independent or not. And this also has implications sometimes in when you're doing inference because independence is great a great thing because that means things can be paralyzed. You can do they don't interact and you can just like do each one independently. So often when you condition on a variable and it makes a bunch of things independent and therefore you can do um things um separately. Um just to give you another kind of fun example. So often you have a bijian network where you have some let's say you some have some variable here and there's like x1 x2 these are your data points and let's say you're doing gib sampling. So if you condition on Z actually all of these are independent right because uh Z blocks the path from any X to any other X which means that you can sample all of these conditional on Z. Okay. All right. Okay. So that's all I'm going to say about conditional independence and probabistic inference. Um and now we're going to move to the main uh course here which is where do all these uh probabilities come from? Okay, everything we've done assume that the probabilities are just given to you or we write them down or we just pop them out of thin air. Um but this is not a very you know principled way of doing things. And so the natural thing to do is to try to learn these parameters from data. Okay. So first we're going to consider the fully observed setting. And by that I mean I'm given a set of examples where each example is an assignment to all the variables. So I see what for every example what all the variables are. Later I'll show you what happens when you only see a subset of the variables which becomes much more uh complicated. And the output of learning is always the parameters of the model. In this case, what are the parameters here? It's the set of local conditional distributions. It's all the numbers in these probability tables. Okay. So what I'm going to do here is I'm going to walk through some very examples. Um I'm just going to show you an algorithm to estimate um the probabilities and then I'm going to under explain what um what the general principle is. All right. So let's start with the world's simplest uh Beijian network and to tell a little bit of a story. So suppose you're um trying to model movie ratings. So you have a single variable r which uh represents the rating of a of

Segment 4 (15:00 - 20:00)

a movie. Okay. So the rating can be either 1 2 3 four or five and the bay network is just depicted as one node no edges. Um so world's simplest basian network and the joint distribution is simply equal to this local conditional distribution of p of r. I've used the subscript um r um because later we're going to see how we want to name each of the local conditional distributions. So just to spell it out the parameters of this uh bijian network is the probability of one probability of two probability three probability four probability of five. Okay so there's five numbers that represent this distribution. There's five numbers I have to estimate. If you're clever, you think, oh, okay, you only need four because once you know four of them, the last one is um given because they have to sum to one. But let's not worry about that um too much. Okay, so what does a training data look like? So here is an example of a train data for this uh variable. So remember a full assignment to just one variable is I saw a movie and it had I rated one saw another movie it's rated three saw another movie it was rated four and so on and so forth okay and so now um oops okay so let me ask you this how would you estimate the parameters just intuitively what would you do all the normal observations and normalize. Okay, great. So that's exactly what we're going to do. Um, so if you're just kind of following your nose, this is the most sensible thing to do. So what we're going to do is we're going to keep count of the number of times we saw a rating. So for every x, um, I'm going to increment uh the rating and its count. So I saw three. So I'm going to update three. I get one count. Four. I get one count for four. I get another four. So I get the four gets bumped up to two. Um so another four. three and so on and so forth. Okay. And so now I have these counts and then I normalize um these counts to get a distribution. specifically I sum the counts all together and I divide. Okay. So, so that's it. And this is basically the primitive that we're going to see and everything else is this is the theme and we're just going to do variations on that theme. Okay. So, this is very simple. It's just count and normalize. Um you probably didn't even need to come to lecture to probably have guessed how to do this. But now let's uh do something slightly more complicated. We have two variables. So um let's say that you're modeling not just the rating of the movie but also the genre of the movie. And the genre let's say can be two values either a drama or a comedy. So this is the two variable Beijian network for um this situation here. The joint distribution is the probability of genre times probability of rating given genre. And here this uh I'm using the subscripts as you can see because I have two distributions. I have a probability of a genre and the probability of rating. And so I'm just using GNR for those two. Okay. So what does the training data look like? Every example has a genre and a rating. So for example, this one says there's a drama movie that was rated five. There's a comedy that was rated one. Okay. So, um, how do I estimate this invasion network? So, it's also going to be count and normalized, but we have to do a little bit more bookkeeping now. Um, in particular, I'm going to count the number of times various genres show up. And I'm also gonna count the number of times a genre shows up with a particular rating. Okay, so when I go through my training data now, I pick up um a drama that I was rated for. I increment the count. Okay, so I got one drama and I'm also going to keep track of for that four drama I saw a movie that was rated four once. Okay. So, another example, I get a drama four and so I get another drama. That's now drama four is twice. Um, I get a

Segment 5 (20:00 - 25:00)

drama five. So, now there's three dramas. Um, and five shows up once. Um, now I get a comedy. Um, and comedy I'm going to keep track of comedy was rated one time. And then there's two comedies now. And that second comedy was rated five. Okay, so this is just a counting. Um and now I get to normalize. Okay, so here um this counts G with is exactly the same as before. We're just going to normalize this these counts into this distribution. Okay, three over five and two over five. And then what do I do with this? So there's actually two uh distributions here to normalize. One is when uh G is drama. So for G equals drama, I'm going to normalize this distribution 2/3 and 1/3. And then when G is comedy, I'm going to take this distribution and normalize. Okay. and then so that's it. Okay. So to estimate a local conditional distribution just ignore the other value variables. Right? So to estimate P of G I don't care what the value of um R is right. So in particular for every local conditional distribution I'm only potentially looking at a small part of the assignment that's relevant for my estimation. Okay. And then other than that it's just count and normalize. Now we're going to go to three uh variables here. So suppose we have a genre a variable um a which means that did the movie win an award or not and uh the rating of the movie. Okay so this is what the Bayian network looks like. Um and you might think well this you know these V structures I hopefully I've taught you to like really be careful of them u because they're a bit tricky. um here I think for learning it's actually going to be pretty um pretty mundane. So um here's a joint distribution P of G, P of A, P of R given G. So there's three um parameters or local conditional distributions, one for every uh variable. So here's the training data. So each um example is a full assignment to all the variables. So this one was a comedy that didn't win award but somehow it got rated still very highly. Okay. So um if we just try to pattern match and think about what we have to do, we have to count all the local assignments. So hopefully you see the pattern, right? for every local conditional distribution and mention some variables, I need to basically count the number of times that local um that uh assignment occurs. So if I'm me estimating P of G, I need counts of G. If I'm measuring P of A, I need counts of A. If I'm measure um estimating probability of um rating given G and A, I need to count uh how many times G, A, and R show up together. Okay. So then um then I just go through my training data. Um I pick up, you know, here's a da no award rated three. So count update um G uh update the counts for A and then um for uh R given G and A or and the way I've uh structured this is very um suggestive of the probability that I'm going to you know estimate and in particular I bundle the values of the parents together. Okay. So for every value of drama, sorry, of the genre and whether I win an award or not, I have a value for the node itself, the rating and the count. Okay, so there's a few levels here. So most of the it's counter normalized but with some bookkeeping, right? So in particular the first level is which local condition or distribution are you looking at and then for each one

Segment 6 (25:00 - 30:00)

of those there's the values that you assign to the parent um and the value assigned to the node itself and then that maps to uh the count. Okay. So we can go through this um the rest of the training data set. Um we get some counts. Um so this there's a count for G, counts for A and counts for this you know triple. And then finally you normalize. So you take um counts of G you normalize it as we've done before. uh counts of a you normalize it um and then for each of these you normalize and you get a probability. Okay. So one thing to note is that the number of distributions that you're normalizing is equal to um the number of uh parent value combinations you have. Okay. So here I have one distribution for drama and zero. Another one for drama one, comedy zero, comedy one. Okay. And importantly remember to condition all the parents simultaneously. The parents work as a unit even though they look like individual nodes. Um I mean they are individual nodes and the edges look like these are sort of separate but u you should really think of these as a unit. So let's move on. Um so now let's do the other type of example. So inverted V structure. So here the assume we have a genre variable G and then we have two users um R1 and R2 representing the ratings of the two users. And so this is what the Beijian network looks like. So I have probability of G times probability of R1 given R2 given G. Okay. And the parameters here are one um local conditional variable for each node. Okay. So the training data hopefully this should be um you know uh fairly uh you know straightforward by now. So for example, this last one says I have a comedy and user one rated at five, user two rated at four. So now um it's count and normalize again. And remember for every local conditional distribution I need to count the number of these local configurations I've seen. So I have count for G, GR1, and GR2. And so I go through my training data. So I pick up drama uh user one rated at four, user two rated five. So I got a one drama. Um user one's count. I note that drama with a rating of four occurred once. And for user two, drama with a rating of five occurred once. And then I go through all of these. Um gonna skip forward and I get counts for uh user one as well as user two. And notice that the counts for GR1 only depend on G and R1, right? So they don't depend on the value of R2 and vice versa for counts R GR2. Okay. So now once you have these um accounts then you normalize. So I'm going to normalize this distribution, this one, this one and this one. So there's five distributions and remember every distribution is given by the local conditional distribution and also the um values of the parents. Okay. So here I normalized the genre as before and then for um let me just compute the answer and show you. Um so then I compute essentially the same structure as the counts except for that now these counts are all normalized. So hopefully you've seen enough examples um that this should be I should be able to give you any basian network and you should be able to estimate. Now let's do

Segment 7 (30:00 - 35:00)

something uh um a little bit uh different. So to motivate things, you know, imagine you have a thousand users. And so if you're building this basia network, you might initially think, well, I have a genre. I have R1 through R, you know, 1,000 and then I have the number of parameters is given by a local condition distribution for every user. Um, but one problem with that is that you might get um, you know, if a user only watches one movie, that's not going to give you a very good estimate. You won't be able to estimate that user very well. So the idea behind parameter sharing is to estimate one um, distribution over rating given genre instead of 1,00. Okay, so I'm gonna have the same Beijian network as before, a genre rating for user one, rating for user two. But what I'm going to do now is, and this is where the indices become important, I'm going to have probability of G, probability of R1 indexed by R and probability of R2 also indexed by R. So the parameters here are just P, G, and P R. So remember before they were P G and P R1 P R2. So I had two local conditional distributions, one for every R1 and R2. And now I only have one even though the Bayia network is the same structure. So the picture you should have here is this idea of parameter sharing. And the idea is that you have this Beijian network and so far we haven't had to really um we just think about the probabilities as sort of implicit. They're like in the edges or something. But when we're doing learning, it matters where these parameters come from. And the real way to think about this is that you have these local conditional tables that are floating out. And these tables are connected. You hook these tables up to nodes to power them to give them probabistic life, right? And so here what we're doing is that we have one table that's powering both of these nodes whereas before we had two tables that was powering each of these nodes respectively. Okay, so that's the mental image you should have. Okay, so how does that change the estimation? Um, so we're going to present the same data as before and now count the local assignments. And remember the counts that we keep track of are associated with the parameters, not the variables in the network. Okay. And the parameters here, we only have two parameters, uh, two conditional distributions, P of G and P R given G. So, we're just going to have two counts here. And when I go through my training data, I'm going to increment uh the counts over genre. So, drama was seen once. And now for R1 and R2, I'm going to increment the same counts as before. So, before um these ended up in completely different local conditional distributions. Now both um four and drama five end up in the same uh one. So it's almost like I have two examples many examples for this given assignment here. Ah good question. So how is this different from just combining these two nodes into the same variable? Um the idea here is that these are still two users right so when you collect data I'm going to collect genre and the rating of user one two right so here the data is exactly the same as before if you collapse it into one node you wouldn't have a place to write down um the both um the rating of user one and user too. Yeah. So up until now we've kind of thought about the variables in a network and the distributions as kind of one to one. But now for the first time we're splitting them. There's a question of when this makes sense to do and the suggestion is that when the two users have of similar probability distribution of R given G, then this makes sense to do. If the two users are very different

Segment 8 (35:00 - 40:00)

then this is probably not a very helpful thing to do. Okay, so we estimated um these counts and then we normalize and normalization is um you know just taking these counts and dividing by the total. So now we instead of having five um distributions we have three. So one for G, one for R given G equals drama and comedy. So another intuition to help you out here is that in proistic inference, which is what we did in the previous lecture, we're only reading from the local conditional distributions, right? We're not changing the local distributions. We're just reading them and using that to make inferences. So in that case, we don't care whether P of R1 given G and P of R2 over G are the same distribution or not. They're just numbers. Um and but when you're learning the parameters, you can think about what we're doing is writing to the local conditional distributions. So it does matter in this case whether the same or not because uh when you're writing to something, it's like in programming, right? when you pass a an object by um reference or by value, it matters um because if you're um you know passing it by reference, then mutations here can cause mutations there. It's a same idea. Okay. And then to kind of the earlier point, when do you do parameter sharing? Um so there's a few considerations here. One is that when you share parameters, you have fewer parameters to work with, which means that you need fewer examples to learn. Right? If you have a thousand users, each one with its own bespoke um distribution, you need to see a lot of data from each user to be able to learn that. But have on the other hand, having more parameters provides greater flexibility. If the user is all very different, then you do want more parameters to be able to model what the users uh the nuances of the user's preferences. So which one you choose is a modeling decision. So on one hand you're trading off are the two users similar which is the point brought up earlier. Also compared to how much data do you have on a user? If you have a lot of data you can afford to kind of split and have more many more parameters. But if you don't have very much data, you need to be very stingy with your parameters. All right. So let's talk about another uh model where we're going to do parameter sharing. Um this is the final example before I give you the general case. So um we talked briefly about the hidden markoff model on the first lecture with a problem program just to remind uh people what it is. In a hidden markoff model, you have um two sequences of variables. One is a um set of hidden variables h1 through ht let's say which might be the position of an object at time t and then um the observation or evidence or emission um which et which you can think about as sensor reading at time t. Okay. Hidden markoff. So the reason why it's called hidden markoff is that this is a kind of a markoff chain on the hidden states. Hidden markoff models are used uh commonly in bunch of places. They used to be used in speech recognition and computational biology. Um and they can be used for object tracking as well. Um so the idea here is that the joint distribution over all these nodes is equal to um remember it's always the probability of each node condition on its parents. So we have probability of H1 um probability of E1 given H1 probability of H2 E2 given H2 H3 given H2 E3 given H3 okay and here there's three types of parameters one is the starting distribution P start which is only used to give you the distribution over H1 and then there's a transition probability a P trans and that gives us the distribution over one hidden state condition on the previous hidden state and then we have a P emit which gives me a distribution over the sensor reading at a particular point in time and given the um the corresponding hidden state.

Segment 9 (40:00 - 45:00)

Okay, so there's parameter sharing here because at every point in time I'm using the same distribution over a sensor reading given it's um the object position and then for every uh time step except for the first one I'm using the same transition um between the previous state and the current state. So the way I would say it is that there is a distri each h is given by a distribution uh condition on its parent. Um but this distribution and this distribution are the same, right? Because of parameter sharing, right? I only have one P trans here. So this is the same distribution that I'm using twice here. And similarly for the emission, I'm distribute P emit. But this is the P emit. This is P emit. This Okay. All right. So, what does training data look like? Um, so I'm going to assume the fully observable case. Um, where um I actually do see the hidden state. Um, so here I have H1, E1, H2, H, E2, H3, E3. And that is some set of values. and then I get another sequence and it's another set of values. Um so to count I keep track of counts for every um uh local conditional distribution. So here there's three of them. So I have three counts. And notice here that you start to see the quite a big difference between these parameters and the variables because this is only three steps, but this could be 100 steps, right? It doesn't matter how many steps there are, but I still only have these three local conditional distributions. So then how do I count? I go through my um you know training data. The first sequence I count uh you know H1 is associated with start. So I update. Okay. So H1 is a zero that gets incremented by one. Emit H1 and E1. Okay. H1 and E1 those are zero and I add one here. Transition um H1 to H2. H1 H2. So that's a zero to a1. So for transition um zero one and I increment that count and so on and so forth and I do it for the second example and then I get a bunch of uh counts and then I normalize. So um this might not be so interesting because these counts are so sparse but um I normalize P start and then for each of these local conditional distributions I'm going to normalize them as well. Oops. Okay. So this is the same as before. Um once you have the counts then normalize doesn't look at the structure of the bijian network anymore right it's just looking at these local conditional distributions and normalizing each one okay so let me move on okay so this is the general case so by now we've seen gone all the way from one to two to three three with parameter sharing hmms. So we've seen a lot of different examples and now I'm going to show you the general story uh for how to estimate a patient network. So it's exactly this the same but um hopefully I'm going to you know give you the code that makes it look a little bit more general. Okay. So the way to think about this is that there is a set D which is a set of types of local conditional distribution. So for example for the HMM D would be start, trans and emit. You can think about D as a set of labels almost. Okay. So D in particular are not uh variables. It could be variables but they're not in general variables because the parameters and the variables are two separate things and the parameters are simply a distribution for each element of D. So for example in HMM you have P start pr and P emit. Okay. So now the joint distribution is uh going to given by um as follows. So for

Segment 10 (45:00 - 50:00)

every um so I have variables x1 through xn. So for every x i look at basically di is going to be the element of d that gives me um tells me which local conditional distribution to use. So remember that picture where each node is powered by a local conditional distribution. Um di tells me which one to go fetch. Okay. So um so I'm indexing that distribution and now have x i given parents of x i. So this part is the same as um before. And the way par parameter sharing manifests here is that two different variables x i and let's say x3 and x7 might have the same di. So d3 equals d7. All right. So let's um go through estimation with this uh HMM example again. So the same training data each data point is a full assignment to all the variables. Um I'm going to define this um network structure here. So this is my way of encoding um the structure of the basian network as well as which parameters power each node. Okay. So I'm going to say for every variable I have the parameter name. So this is the element in D and uh and also the list of parent variable names. So H1 has no variable names and it's given by P start. H2 is has uh one parent which is H1 and it's driven by P trans. H3 is given but driven by PR from H2 and emit is uh each one is driven by P emit and it depends on the parent H1. Okay. So this is basically a representation of the structure where in addition I tell you which um parameters I'm using. Okay. So now with this network structure there's a very simple function that takes this network structure takes the training data and just gives me parameters. Okay. So let's walk through what that looks like. Um, so ignore the pseudo counts. I'm going to come back to this uh later. So I initialize the counts to nothing. And then I'm going to loop over the training data. And for every variable that's assigned a particular value. So here is assigned zero. I'm going to look up the network structure and tell me who are my parents and what parameter am I using. Okay. And once I do that, I can access the assignment indexing the parent variables to get the value of the parents. So here there's no parents. Um so the parents value is uh is also empty and then I increment this counts. So there's multiple layers here right one is um the parameter name the value of the parents and the value of the node mapping to the count. Okay. So let's go through E1 here. Um so E1 has value uh zero and its parameter value is uh parameter name is emit. Um the network structure tells me that and its parent is H1. So what I'm going to do is I'm going to say okay well H1 uh let's look up that and that value of H1 is zero here. So now I update uh p emit. So the count for emit because that's the parameter name parents value which is zero that was given by um parents variables indexing into x and then finally the node value um because e1 is zero and I increment

Segment 11 (50:00 - 55:00)

that by one. So before I had counts_g counts_r but now I'm just replacing this with one nested dictionary which is counts and I'm using the names as the first key and that gives me um basically this conditional distribution over nodes value given parents value. Okay. Um, and then I'm just going to skip ahead and at the end of the day I have a set of counts. So this should be the exact same as what we computed for HMM using a slightly more manual way. And then uh to normalize um I have theta is going to denote all the probabilities here. I'm going to go through each parameter name P you know start um emit and trans and then I'm going to go through every parents value. So remember this is a tupole for all the in general for all the parents and then I'm just going to you know normalize. Okay, so for every parameter name and parents value, I have a distribution that I'm normalizing and that gives me uh this distribution. Okay, so there you have it. This is uh you know about 15 lines of code and this uh estimates you know parameters in any arbitrary vision network. So it's count and normalize, but you just have to do some bookkeeping and make sure you're um you're counting the right thing. So this assumes you know the network structure and it's fixed. There are methods where you can learn the structure as well, but we're not going to talk about that in this class. So good question. Why is this called supervised learning? So the idea here is that you're fully observing all the variables here. I should probably call this more accurately the fully observable uh setting. Um the way to think about it is that this is in contrast with potentially unsupervised learning which we're going to see an example of where the some of the variables are unknown. So it's it would be I guess you could think about this as fully supervised and later we'll see partially supervised. Okay. So let me talk about the principle here. So far we just said count and normalize seems believable that this is a good thing to do. Um but it's more than that. It's actually corresponds to exactly maximum likelihood estimation. So what is maximum likelihood? It's a very simple and um powerful principle in you know statistics um and you know in all of machine learning really and if you think about training a large language model that's what you use minimizing cross entropy minimizing perplexity is exactly the same as maximum likelihood. So the principle says find parameters that maximize the likelihood of the data. Okay. So it's kind of transparently in the name um in symbols. What this looks like is that if you have data um a data set X um and you have a model that defines a distribution over assignments um given by some parameters. What I want to do is choose the parameters that gives that drives this likelihood as high as possible. Okay. So in other words, another way to say it is that if I let's say I give you a set of um parameters. So I specify the probability of G equals 0 is uh sorry G equals comedy is 7 and I fill out the conditional distributions. So given that I can evaluate the likelihood of my data by just multiplying the probabilities over each data point and now for every probability I have a number and so I can think about this as optimization problem. I just want to find the conditional distri probabilities that maximize that the likelihood of my data. Another way to write this is if you take the log because log is mono mon monotonic this is equivalent to sum of the log probabilities and this is the negative cross entropy. Um so it turns out that the solution to this

Segment 12 (55:00 - 60:00)

is in the fully observable case is just to count and normalize. And this is really interesting here because normally you think about optimization as oh dear I need to like compute the gradient take steps and what about the step size and what about convergence? No. So in this particular instance you can actually you don't need to do iter optimization at all. There's just a closed form solution which is called count and normalize. So let me try to show you briefly why this is by some quick examples. So maximum likelihood with one variable here. So we're going to go back to this really simple case. There's one variable R. It's the rainy of the movie. And let's say I have training data R5. Okay. So if you write down the maximum likelihood objective, it's probability of one probability of five times probability of five. Okay. And theta here is going to be the probabilities of each of these um ratings. Now the one thing is you have it has to be a probability. So all these have to sum to one. Okay. So um the I'm not going to go through the math formally um but the way you do this you solve this optimization problem is you introduce the lrange multiplier for the sum to one constraint you set the gradient uh compute the gradient set it to zero and solve and then you get that the probability is essentially uh equal to the number of times a particular rating occurred over the total number. And intuitively you can think this is this makes sense because if you see five you want to drive up like you want to increase the probability of um this number right you didn't see four so probability four can be just zero um you don't want to waste probability mass um but you see this once you see this twice so you probably want to place more probability on these That was a little bit of uh kind of sketchy intuition, but um if you do the math, it kind of works out. So now in the two variable case, um let's say you have a genre and a rating um then what happens? So if you look at the probability of your data, it's probability of drama times probability of four given drama. So for the first example probability of drama for the second example probability of five given drama probability of comedy and comedy. Okay so there's six factors corresponding to these three examples. Now the observation here is that you can rearrange these six factors into groups that actually have something to do with each other. So I'm going to put all the probability of genre together. So I'm going to group this drama, comedy. I'm going to put it all together. And then I have um probability of some rating given drama. So this guy and this guy go together. And then finally this um probability of five given comedy. Okay. And you now notice that each of these is an independent optimization problem because the probability of um comedy has is can be very independently of the probability of five given drama. So this gives you three separate optimization problems. One for this group, one for this group and one for this group. And that corresponds to each solving each of them by the one variable case where you're just counting and normalizing. So maximum likelihood means finding the parameters that maximize the likelihood of the data and count to normalize is a closed form solution. So this is this set of ideas is really nice and you can now sleep at night in comfort that you all your intuitive counting normalizing thing is actually uh principled. Okay. So let's move on now. Um let's talk about lelass smoothing. Okay. So let's go back to the one variable case. we have just one variable um R and uh the training data is I just see one rating and I see a four rating. Okay. So if I do supervised learning on this in a standard way maximum likelihood now we have you know fancier

Segment 13 (60:00 - 65:00)

terminology to talk about it then you would say that the probability of one is 0. 5 probability of four is 0. 5. So you look at this and you say okay well that means the probability of two is zero right and do we really believe that if you got two points and you said you one was four and one is one you know putting zero probability on two seems a little bit close-minded right and it's probably just because you didn't have enough points rather than actually it being true that there were no the probability of two is like two is impossible for some reason. So this is in some sense like not a very good estimate if you have few samples and if you have a lot of samples it's probably fine but few samples um maximum likelihood is um overfitting in some sense to this data. So um the solution here is lelass smoothing um and it's a very simple idea. The idea is that you just add some quantity smoothing factor lambda to all the counts. Okay, so um here's what we're going to do. We're going to say let's just uh we're going to call these pseudo counts. Let's just assume that let's pretend we saw this movie and we got we saw one, we saw two, we saw three, we saw four, we saw five. Okay, we didn't actually see this data, but let's just pretend we saw this data. And now I'm just going to take these pseudo counts and add them. When I do supervised learning, I'm just going to add them to um my counts and uh get, you know, theta. So remember before this was um this was one and four. Let me quickly do this on the board to make it a little bit clear. So the data is R1 R4 um data and um so if I have one two three four five here and I have count so what I'm doing here before I just had um radian one radian four and then I would just normalize these two and I put zeros everywhere else. But now what I'm doing is I'm going to pretend I saw a um one everywhere and now I normalize. So this is really a two. So now this is two out of seven. One out of seven. So there's still more likely more weight on the ones that I did see. But the other possibilities where I didn't see aren't exactly zero, but they're just a smaller value. Okay. So, and that's what exactly these probabilities are. Okay. So, a few properties about smoothing. One is that um as smoothing the ghost is zero, we get back to the original maximum likelihood. Okay. So if um if the pseudo counts are zero then if you do maximum likelihood sorry if you do supervised learning then you get back the original maximum likelihood estimates um and if you smooth to really big then it'll get uh to a uniform distribution so if I do put 20 here as my pseudo counts then my distribution is pretty much uniform. There's still a little bit more weight on one and four, the ones I saw, but it's pretty uniform. But the thing is that no matter how what the pseudo counts are, if you have more and more data, let's say you get a thousand examples of uh four, then uh eventually this will give you this will get closer and closer to one and all these will get closer to zero. Okay. So, generally when you use lelass smoothing probably lelass smoothing with one or even 0. 1 is right. You wouldn't use like 20 uh unless yeah I don't think you would use 20. Okay. So summary lelass smoothing um you just add the pseudo counts to all the counts and then you count and you normalize. So you can think about it as almost like an initialization and it prevents these zeros or overfitting in your probability estimates. All right. So let's do expectation maximization. This is one of my favorite algorithms um because it sort of is a

Segment 14 (65:00 - 70:00)

bit magical. Um so far we've assumed that the training data consists of full assignments. But now what happens if some of the variables are unobserved? So remember the fully observed setting. Let's just take this as example. I have G, R1 and R2. And my training data therefore would be full assignments of G, R1, R2. Okay, I have two examples one and 22. So um network structure G R given R1 given G R2 given G and bam, I get uh my estimate. And this kind of makes sense. I have one drama and one comedy. So there probabilities are split. Um for R1 I have probability of rating one given drama is one. Probably of uh rating two given comedy is one and so on and so forth. Okay. But in the partially observable setting I still have the same model but now my data is different. Now I don't have a G anymore. Okay. So now what should we do? So I can't count because if I don't know the value of G, I can't count G and [snorts] everything actually depends on G. So I can't even estimate anything. So that's not good. But now this is uh the idea of having principles is that you can always go back to your principle and think you know what would maximum likelihood do and uh maximum likelihood says obviously maximize the likelihood. So um and of the observed data and here the data is whatever is observed. So let's try to maximize the likelihood of this and then so what we have is um here's the pro probability of my uh full assignment in this case it's going to be um probability of R1 equals R1 probability of R2 goes R2 okay so this is the max likelihood of the data and because I have a basian network I know that this is actually summation over G over probability of G equals G. Right? So the cool thing is that I don't see G, but I can just, you know, marginalize it out and get the probability of um R1 and R2. Okay, so what we have going on here is we have the observed variables R1 and R2. We have unobserved variables G and we have the parameters. And the challenging thing is that we don't know either we only see R1 R2. We don't know what G and theta is are. So what EM is going to do is it's a way to maximize this likelihood. I'm not going to formally show this but you know you have to trust me um on that is um it's sort of a chicken and egg problem. If I knew the parameters theta, so I know the probabilities, then I can actually compute the distribution over the unobserved variables G given the observed variables given the parameters. And then if I know the unobserved variables G here, I can then I'm back to the fully observable case and I can just count and normalize and estimate the parameters. So whenever you have a chicken egg problem, what you typically do is you just bootstrap up, right? You just initialize the parameters randomly and then you're going to iterate. And the EM algorithm um consists of two steps. The E step where you guess what the unobserved variables are. And by guessing, I mean more formally computing the distribution over all possible values that the hidden variable could take on. P of G and that is going to give me a set of weighted full assignments. Okay, I'm going to show you example. So hopefully that'll make more sense. So given the EEP, the E output of the EEP is essentially stuff that looks like fully observable uh data. And then in MTEP, I'm just going to um count and normalize using these full weighted full assignments. Okay, so let's walk through an example here. So expectation maximization, this is not a general function. It applies to just simply this model um here. Um but hopefully it'll be illustrative.

Segment 15 (70:00 - 75:00)

Um okay, so I first initialize the parameters. So the probability of um G the probability of R um given G um here I'm actually I'm doing parameter sharing um here. So now I'm going to iterate. Okay. So um this is my current guess and uh in the E step I'm going to go through all my data here. So my first data point was 1 one. Second data point is 22. Okay. So I'm basically going to now see um compute the probability. I'm trying to guess what the value of G is. I see one one. Um so let me just kind of write this out a bit more. So okay so I have G I have R1 R2 okay and um I have one here I have 22 here but I don't know what the value of G is okay so what I'm going to do is I'm going to guess zero I'm also going to guess one um and I'm essentially going to replicate this you know once for every poss possible value of g and then I'm going to give um each of these a weight and the weight of this is going to give be given by my current guess of how likely this zero is. So in particular the weight here is going to be probability of g equals 0 given r1 = 1 r2 = 1. Okay. And this is given g = 1 given r1 = 1 and r2 = 1. This is probability of g = 0 given um r1 = 2, r2 = 2. And this is the probability of g = 1 given um r1 = 2, r2 = 2. Okay. So what I'm doing is I'm using my model to essentially assess how likely zero and ones are for every data point where I don't see the value of g. And I think by zero and one I meant comedy and drama. Um just okay. [clears throat] So what I do here is to compute now I have to compute these values. So what I do is well that is simply the joint distribution divided by normalized um by the marginal the probability of the evidence. Okay. And to compute the joint I have the probability of G times the probability of R1 given R2 given G. Okay. So that gives me drama. And then for comedy I get some um number. This is all based on these sort of parameters here. And then I normalize. So basically here I have um this is my guess based on these parameters that the first data point one is uh 69 and. 3 okay so um let me actually just make this uh comedy and drama so for a consistency comedy drama okay so what Notice that for this first example I have probability of drama comedy is 0. 3 um probability of drama is uh 0 you know 69 okay and then um then what I'm doing is I'm collecting this weighted uh data set so what I'm this is what I have on the board so I have this data point where I didn't see G but I'm just guessing a value of G and I associate with this weight for comedy weight. Okay. So now I go through the second example um and similarly I'm going to compute a distribution and I'm going to normalize and this one has um higher weight to uh comedy than drama and I add two additional examples here. So I have one drama, one comedy, two drama, 22 comedy, right? And so the values here are um you know the

Segment 16 (75:00 - 80:00)

probabilities are the weights are 69. 3 69. So um let me fill this out. So this is 69 and this is. 3. Oops. Sorry if you can't see that. Um, okay. But whatever is on the board is just in this picture here. Okay. So now I have this weighted data set. Now I do the mstep and the mstep is should be familiar now. The only thing we do is instead of adding one we add the weight. So before everything data point counted once. Now every data point counts um the number of times according to its weight. Okay. And then I normalize um and this is getting a little bit. Okay. So I normalize and now I have an updated set of uh probabilities. um instead of 6 um and point 4 I have now 69 and three okay and if I iterate this um you know let's say iterate this you'll see that the probabilities um start changing iterating and eventually they get uh to oops sorry iteration four they get to something pretty close to one and zero. Okay. So, um it's a few notes. One is that the model doesn't see the hidden variables. So, it's kind of trying to set the hidden variables to drive up the probability of the data. Right? And in this example, the best way I see one and I see 22. So it's sort of going to make an assumption that dramas are one always going to give one and comedies are two which is actually what you know the same as a supervised data that we had. Um so here's another example. So this is where I have a bunch of ones and two twos and some two one and one two and in this case um em gives me uh these probabilities. So it's about um you know it's there's slightly more probability of drama and slightly less comedy. Um and then the probability of drama uh equals 1 is 0. 9 and comedy is 75. Okay, so these are probably these are not quite as sharp as one and zero because now we have more heterogeneous assignment. It's not like all one and all two. Um some notes about EM is that it we're guaranteed to increase the likelihood each iteration um and converge to a local maximum. This is not a global maximum. So EM can't get stuck in local optimum and you need to initialize with non-uniform distributions to break symmetries. That's why I put 04 and point 6 instead of 0. 5 and. 5. Um, and then the other caveat is that you can only recover the hidden variable up to the permutation of the labels. Which means that in this case if you switch the labels drama and comedy um you get the same likelihood of the data because these are just labels. Um so this is good for clustering where the labels um don't really have necessarily semantic meaning. Um okay so wrapping up the learning task here that we've talked about is you're given training data and you want to estimate the parameters of the basian network. The parameters are the local conditional distributions. But bear in mind that with parameter sharing the local condition, there could be fewer local conditional distri distributions than number of variables in your Bayian network. If you're worried about overfitting, you should use LLAS smoothing, which is basically adding pseudo counts. And then finally, we talked about expectation maximization, which allows us to handle partially observed data. And hopefully the intuition is clear where if I don't see a variable, I'm going to try to guess what that variable is with and weight that guess by the probability of that variable given uh the observed variables and I have fully labeled data that's weighted and I do count and normalize like I do in a supervised setting and I iterate. Okay, that's all I'm going to say for uh

Segment 17 (80:00 - 80:00)

bijian networks. Next time we're going to do something quite different. We're going to start talking about logic which is in going to enable us higher order reasoning. So Beijian modeling is about a certain type of probabistic reasoning. But higher order reasoning will allow us to use quantifiers like you know every student has a TA. So the word every is something that requires a bit more sophistication.

Другие видео автора — Stanford Online

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Дайджест Экстрактов

Лучшие методички за неделю — каждый понедельник