Stanford CS221 | Autumn 2025 | Lecture 12: Bayesian Networks I

Stanford CS221 | Autumn 2025 | Lecture 12: Bayesian Networks I

Stanford Online

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Оглавление (16 сегментов)

Segment 1 (00:00 - 05:00)

So, welcome back everyone. So, today we're going to switch gears a bit and talk about Beijian networks. And to motivate this shift, let's just uh review what we've done in this class so far. So, when we started, we talked about the ingredients for intelligence. So, agent has to perceive the world. It has to reason with that knowledge. It has to act and it has to learn. So where we started was just with machine learning where we take percepts and return actions. So we train these predictors. They could be linear or they could be deep neuronet networks. But the basic idea was that you just take your whatever you sense in the environment and then you just produce a prediction or an action. And then over the last weeks or so, we looked at state-based models for reasoning. And the idea is that agents in order to be intelligent can't just reactively um predict based on the environment. It has to do some thinking or planning. And so along the these lines, we looked at search problems. These are models where the actions have deterministic consequences. So the agent can think about finding the minimum cost path. Um then we looked at markoff decision processes where the actions have stocastic outcomes but the distribution over those outcomes is known and then you can do things like value iteration. And then we also looked at uh games where we know how the world works except for there's an adversary um who's trying to out to get us or more generally unknown opponent strategy and we want to be conservative and then we had to do things like miniax. Okay. So um in each of these cases we looked at both model free and modelbased approaches. So model free essentially says we just need to know just enough to know what actions to take and how much utility it will produce. Okay. So examples include classification, regression, SARSA Q-learning, TDarning. These are all approaches to try to directly get at um the actions given the state or the environment. We briefly looked at modelbased methods as well and the idea of modelbased methods is that we're not content just to be able to predict the right action. We actually have to understand how the world works. So when we did search or we did value iteration or miniax these are model based in the sense that we have a model of the world we understand if we take this action we know what will happen and then that allows to plan. So we take this action this will happen and we take that so on so forth right so in some sense it's a you know a more powerful way of doing things but it is harder because um just being able to know what action to take seems like a simpler problem than understanding how the world works. So model free me methods are direct and cheaper which is why I think a lot of practical applications people just go to them because they're um they're faster but modelbased methods are more flexible. For example, if you had a model of how the world works, you can change the reward function on the fly and then you have the transitions the same and then you compute the optimal policy. Whereas if you had a Q value that sort of compiles down all the information about the rewards and transitions. So you can't really the Q value doesn't know am I taking this action because it's more likely that I'm going to get something good or is it because the expected outcome is more right so if you have model based it gives you more um power. So then there's a question of okay so if we buy into this idea of modeling the world how should we represent the state of the world seems like a very daunting thing. So that's going to be the topic of Beijian networks. So Beijian networks is going to be about how you represent the world and how you reason about the world in face of uncertainty. Okay. So it's going to be a pretty different um shift in the way we think about um problems and hopefully um it will become clear as we go along. So to introduce Beijian networks we need to first um introduce the rules of probability. Now all of you should have uh taken probability. So this hopefully

Segment 2 (05:00 - 10:00)

should be a review um but I will do it in a slightly um maybe unorthodox way um which will make it more interesting. Okay. So we want to represent the state of the world um by a set of variables and these variables also rand called random variables are going to represent attributes of the world. Okay. So as a very simple example, you might have two variables S which represents whether the it's uh the sun is shining or not. So one means it is and zero means it's not and another variable r which is um whether it's raining or not. Okay. So, so given these two variables, we can now characterize what state the world could be in. And these are called assignments to random variables. So, if you have s= 0, r equals 0, that means it's not sunny and it's not raining. Um, s= 0, r= 1, it's not sunny, but it is raining. Um, s= 1, r equals 0, it's sunny, but it's not raining. And here, it's both sunny and rainy. Okay, so these are given two variables. You look just look at all the possible values that they could take on and those are your possible uh assignments. So, so next step is that we want to make each assignment associate each assignment with a probability. So we think about a joint distribution over a set of random variables as providing a probability for each assignment. So maybe the probability of um it s equals 0 r equals z so no sun no rain um is 0. 2 and this might be 0. 08 08 um 0. 7 and the probability of it sunny being sunny and rainy is 02 002 which makes sense because it usually isn't sunny and rainy at the same time. And because we're in California, the probability of sunny and not rainy is is pretty high. Okay. And I guess if it's not sunny and it's not rainy, it's probably I don't know nighttime or something. Okay. So um what I'm going to do now is you know represent this in code. So I have this class called prob table and it basically is a wrapper around a numpy array. So in this numpry array is a matrix and you can think about the first in element of the um first axis of the matrix meaning the rows are the values of s0 zero and the columns are oh sorry the values of s and the columns are the values of r. Okay. So just to be very clear about this. So you have S and you have R. You have 0 one 01 and um I think what were the values here? Um 2. 08 08. 7 0 point Z wait uh 7. 7 02 okay so this is this um this matrix or tensor as we'll see later that represents the pro joint probabilities okay now in this blue what I've done is I've sort of rolled out um this into a matrix because once we have four variables um it's going to be really hard to draw four-dimensional um tensors. So basically um this is equivalent representation where for every variable assignment I have a probability. Okay. So the probability of sun and rain is 0. 02. All right. So this is a joint distribution over all the variables. So every assignment gets a number which is a probability. And now okay so um the first operation is called marginalization and marginalization intuitively just collapses assign it sort of ignores a variable and what I mean by ignoring the variable is that it's going to collapse all assignments that only differ in R. So if you have um

Segment 3 (10:00 - 15:00)

probability of s= 0 that's s= 0 r= 0 plus s= 0 r= 1 right we don't care what value of r takes because r is being marginalized out. So we sort of collapse these two assignments into one and add the probabilities. So that gives us uh 0. 28. And then the probability of S equals 1 is you collapse these two assignments which differ in R but don't differ in S. Okay. So pictorially this corresponds to combining these first two rows that's probability of S equals zero and the second two rows which is the probability of S equals 1. Okay. And if you marginalize multiple variables um then you're collapsing potentially many more rows. as well. Okay. So, the output of marginalization is um a table that has is smaller and here comes the cool part. Um so, you thought you were done with INOPS, but it's coming back now. Um so, probability tables are just tensors, right? So, this thing on the board is a matrix, but is also a tensor. And therefore we can actually use inops to express computation. And it turns out all the laws of probability are essentially all covered by inops. Um so um hopefully this will give you a deeper appreciation of inops. So just to write out uh formally what the marginal probability is. So this says that the probability of S equals S is the summation over R the variable you're marginalizing out of the joint. Okay. So this is for all um values of S and you're summing over all values of R. So if you remember in ops this if you see something like this then you should be able to just have a oneliner that represents that computation. And what I'm doing here is so PSR is this table, the joint distribution. Um I have the P because it's just that the PSR is this probability table and P gives you the underlying um tensor and that's what Einstein takes. So it's a bit of a um just notational nuisance but don't worry about it too much. Okay. And then remember the description says I label the axes of this tensor. And so the first um axis is S, second one is R. And then the output tensor has the label S here. So the way to interpret this is that for all values of S, I look at this and I sum over R or in general the variables that don't show up on the right hand side. I know inops was a bit confusing at first. So maybe I'll just pause and make sure that people understands or marginal probabilities. All right. So the other major operation in probability is conditioning. So intuitively the way to think about conditioning is that when you condition on a partial assignment like R equals 1 rain it's raining means that you're only selecting the assignments that satisfy that condition. Okay. So, um, the way we're going to do this is we're going to say, well, look, these are the possible assignments. R equals 1 means I'm only going to pay attention to the rows where R equals 1. So, there's two rows that S equals where S is zero and R is one and S equals one and R is one. The other two rows just don't matter. So in general I'm always going to be conditioning on a particular uh value here which means I'm selecting a set of rows. Okay. Um and then I compute the probability of the conditioning event which I'm going to call the So I observe evidence. I look outside and I see it's raining. Okay. And so I now I want to condition on that evidence to make other inferences. So then I have to compute the probability of this evidence which is essentially the sum over all the assignments probabilities

Segment 4 (15:00 - 20:00)

of all the assignments um that are left after the selection process. So here I have uh um 01 and one which has um probability 0. 1. Okay. And then finally I divide by this probability. So um s=0 given r= 1 is a selected joint probability divide by this um probability of evidence and same for s= 1. Okay. So this you can think about this as a reormalization. So first select based on the evidence. So I look at only the assignments that matter. But these are really low probabilities because the probability of rain itself is small. And so when you condition you're saying like if it rained so that means you're sort of dividing by the evidence. So even if the probability of evidence is is small you're sort of like blowing up these probabilities so that they sum to one. Okay. So you can also use inops to compute this conditional distribution. There's just a few uh more steps but hopefully this math makes it clear. Um so the first thing we're going to do is how do we do the selection? I'm going to introduce this um tensor R1. So this is basically saying if I put in a value of R, it's going to tell me whether the constraint uh is satisfied or not. If I put in zero, that's a zero because R is that's false. And if I put in one, um I get one or true, which means that it's uh true. Um okay. So then what I'm going to do is take my joint distribution PSR, this thing over here, and then I'm going to essentially multiply this by R1. Okay. And that's going to zero out a bunch of things. Okay. So now I have two tensors here. Um, and just to keep track of the axes. So the first tensor is just labeled with S and R. And the second one is labeled with R. And then the output I want S. Okay? Because I could put R here, but R is always going to be one because that's what I selected for. Okay. So that gives me this quantity which is um what I had before. So this is the selection of this with this evidence R equals 1 where I simply you know drop the R. So R doesn't show up on the right hand side. Okay. And then it's basically the same thing as before. um we norm um we compute the probability of the evidence and the way you do that in INOPS is um you look at this probability table and which is indexed by S and then you don't have an output which means that it just sums over um all the elements here and that gives you 0. 1 and then finally um you divide. Oops. So this when you divide um probability of s= s r= 1 divid by probability r= 1 you get probability of s= s condition r equals 1 and you can see that gives you 08 and 2. The question is are you going to represent this last step with any um in sum right now this is just a scalar so you don't really need any fancy yeah so question is whether this works for discrete or continuous variables so everything we do in this class will have discrete random variables here and you can use in sum um if you have continuous variables then you need when you do marginalization you have to do integrals um And you know conditioning with conditional variables becomes a little bit you know dicey because you can condition on basically probability zero event. So you have to be uh careful there. Okay. So um let me sum summarize. So you know I think many of you probably have experienced probability and done probability before. Um I think it is important to have the right kind of conceptual intuitions for what all the probabilities represent. So joint distribution we can think about this as a source of truth. Remember we're modeling states of a world. So for every

Segment 5 (20:00 - 25:00)

assignment which is a state of the world I have the probability that is that state occurs. Um and then everything derives from the joint distribution. So marginalization, the intuition is that you're collapsing assignments that only differ in the marginalized variables. So you're ignoring some kind of differences in some variables. And then conditioning you should think about is you're you observe something in the world like it's raining and you want to condition on that which means that you select by the evidence compute the probability of evidence and you divide and that gives you the reormalized probability of whatever you're interested in condition on that event. And finally the probability tables are tensors. So we'll be using to express these uh computations. Okay. So, so now I've introduced basic uh probability. Um let me talk a little bit about probabistic inference. Um in general, um here's just another example. So suppose you have four random variables. Now sunshine, rain, traffic, and whether it's autumn or not. Um now the joint distribution would look like this. um probability of SRTA and what do you do with this um joint distribution, right? So what you do with it is that you can ask questions of it. For example, if you're interested in knowing what is the probability that it's going to rain given that there's going to be traffic and also it's u autumn. And the task of probabistic inference is to answer questions such as these based on the joint distribution. So an analogy you should have in your head is that the joint distribution is like a SQL database. And proistic inference is essentially doing SQL queries on that database. And there's actually a much formal repres uh relationship here which I'm not going to get into but I think that's the mental model. you have your database with all the facts and you just ask questions, you know, what is the tallest um you know, building with in the Europe or something like that. Okay. So, what is the type of uh question we can answer? They're all going to look like this. So, they're always going to have some evidence that we're conditioning on. And the evidence selects a subset of the variables such as t and a and uh basically assigns them to particular values. So in this case it's traffic and autumn. And given the evidence we're interested in a query um rain for example. So in probability notation what we're asking for is the probability of rain given tra the evidence uh t = 1 a= 1. And notice that not all the variables are mentioned. The variables that are not in either the evidence or the query are assumed to be marginalized out. For example s here is sunshine. It's not mentioned. So, I'm going to remember collapse assignments that have the same um that have that don't have the same value of s. Okay. So, so now let me go through another example. Um now what we're going to do is we're going to start um introducing Beijian networks. So far there's nothing Beijian networks about it because Beijian networks is built on top of probabilities. So we have to understand probabilities uh first um and the whole point of Bayian networks is that it will make things um easier to define joint distributions. So that's where we're going. So I'm going to start with an example um and let's try to solve it using a basia network. Um later I'm going to define formally what a basian network is. Um but by that time you hopefully you'll already understand through examples. So this is a problem of earthquakes, burg burglaries and alarms. Okay. So we're going to assume that earthquakes and burglaries are independent events and each happens with probability uh 0. 05. So 5% chance which is actually kind of high but um but let's go with that. So, you've installed an alarm in your house and um the alarm will go off if the there's an earthquake or a burglary.

Segment 6 (25:00 - 30:00)

Okay? And suppose that um you know, you're away on a trip somewhere, but you know, of course, your alarm is hooked up to your phone and you see, oh, there's an alarm. And suppose that this alarm it doesn't tell you what why the alarm is going off. It just tells you that there's this alarm. Okay. So then you probably know at this point something's wrong. Okay. But you don't know what it is. So now here's the question. Um so suppose you then you know you pull up the news and you see that oh there was actually an earthquake um you know near where you are your house was. So, how does hearing the news change the probability of the burglary? Okay, so I'm gonna ask you uh I'm going to take a poll on this. So, there's three options. The probability of the burglary given that there's an earthquake and alarm um you know go goes down um it goes up or it stays the same. So, you this is what you might be thinking. Um, well, intuitively it seems like it should go down. But then, hm, they're supposed to be independent. So, if they're independent, why would it go down? Because like what do burglaries and earthquakes have to do with each other? Nothing, right? Okay. So, there's a like paradox here which we're going to resolve. So, just hang tight. Okay. So the way we're going to resolve this is to be define a Bayian network and then everything will become crystal clear. So what we're going to do is define a joint distribution over earthquake, burglary and alarm. And then what we want to know is which is larger or where they're the same. the probability of a burglary given the alarm only or the probability of burglary given alarm and there's an earthquake. Okay, so now I've formally specified what this um what I'm interested in terms of probabistic inference. Okay. So then we're going to solve this problem using basian networks. So we're going to do two things. We're going to construct a joint distribution and then once we do it then we can do probabistic inference. Okay. All right. So let's first construct the joint distribution. Um so we could just write down this huge table directly, right? So I mean it's not that big. There's um eight possible assignments to three variables two to the 3. So we can just write them all down. But it's going to be much nicer to do this in a more intuitive way. And the more intuitive way has four steps here. First, I'm going to define the variables. Then I'm going to connect the variables with directed edges. Then I'm going to write down a local conditional dist probability for each variable. And then finally multiply everything together to get the joint distribution. So I'm going to do this in a more modular way. And I'm going to go through the four steps. But that's the the road map here. Okay. So the first step is you define the variables. Okay. So what are the variables? Well, there's B, which is there a burglary? E, is there an earthquake? And A, is there a alarm going off? Okay, done. Um, usually that's the easy part. Step two, we're going to connect the variables with directed edges. um which suggests just suggests direct influence and here I want to be a bit careful because the natural thing would be to say it causes but uh causality is a pretty tricky topic. So technically Beijian networks do not are not u automatically um you know imply causality. So um but if you want you can think about it as a cause if it helps your intuition just know that it's not um completely formal. Okay. So here are my three variables. Each variable is a node and um and the idea is that well there's burglaries and the earthquakes and then alarm depends on burglary or earthquake right so alarm goes off if B or E. So that's why alarm is sort of derived from B or E. Okay. So now we have this um probability. Now I'm going to um let me just do it on the board here um just so we have it as a reference. So we have probability of B um A and oh sorry

Segment 7 (30:00 - 35:00)

B E and we have these edges here and then so now for each node every variable I need to specify how it behaves and here I'm going to write down a pro a local conditional distribution of a node given its parent. So here I have the probability of a burglary. you know earthquake. Um and then here I have the probability of alarm given burglary and earthquake. Okay. And for each of these I need to specify the probability table. So I'm gonna assume based on the problem statement that the probability of a rare event is epsilon which is 05 and so the first uh your probability local distribution is u probability of b equals b. So the probability of no burglary is 0 n5 probability of burglary is 05. Okay, the same for earthquake. Probability of no earthquake is 0 n5. Probability of earthquake is 05. Okay, so you can think about each of these tables as sitting um on a node. And now here um just to ex explain this um a bit. So now I'm giving you the probability of a given b and e. And um in code I have this you can do this uh this trick which gives you a compact way of representing this table. So instead of specifying every single combination you can write a function that computes every single value and the function here is um basically called for every row and it takes the value of B, value of E and value of A and returns the probability. Okay. So what this is doing is saying let's see if A is equal to the prob B or E. So if it's equal then it has probability one and if it's not equal has probability zero. So for example if B is 1, E is 0, A is zero that's uh 0 is not equal one or zero so that's a zero. Whereas one is equal to one or zero so that's a one. So that's how all of these uh probabilities are filled out given this um this function. Okay. So now I have these three local conditional distributions, one for each of these variables. And that's the end of step three. Step four is um defining the joint distribution as a product of the local conditional distributions. And here this means for every possible assignment to B, E, and A, the joint distribution is defined to be the product of the local conditional distributions. Again, whenever you see something like this where there's a bunch of for alls and then um some remember each of these is going to be represented as a tensor um and you multiply the tensors together then you can use you know sum and let's just see what's happening. So I have PB P and P A given B and E and I'm multiplying all these things together. That's what hindsum does. And this strain tells me um how all the indexes are going to behave. So um PB uh is this tensor corresponds to indexing B. This one corresponds to indexing E. And then this one I have B E and A. Okay. And the output is B and A. Okay. So, I'll just show you what the result looks like. Probability of B, E, and A. Um, I mean, numbers aren't too important, but they're just, you know, this basically is the product of these three um probability tables. And just to sanity check, the probability of nothing happening is 0. 9. So, that's reasonable. Um, nothing usually happens. the probability of, you know, the probability of there's a burglary and an earthquake um and your alarm goes off is um pretty low. Okay, so let me pause here. Um so these are the four steps and we've defined the

Segment 8 (35:00 - 40:00)

joint distribution over all random variables and that characterizes the problem uh setup precisely. The next part is once I have this joint distribution, this is our database. Right? Now I'm going to do equivalent of SQL queries and that's protic inference. So now we can answer any queries we want. So I'm going to do a few um and we're going to look at what the results uh tell us. So um first let's just warm up with something very simple. So let's compute the probability of uh a burglary. Okay. So every one of these queries is going to begin with this joint distribution. That's our database. Um and do a bunch of marginalization conditioning as needed. So this um is you know fairly straightforward. I have this joint distribution and I erase E and A which means I'm marginalizing out and then that gives me um probability of B. Okay. And you'll notice that this is the exact same as the local conditional distribution of B that I had earlier which is good. Um I want to take this opportunity to point out that I'm explicitly um differentiating between these local conditional distributions which are I'm using a lowercase P and um the marginal distributions which I use uppercase P. And the way to think about this is that local conditional distributions are just defined. I just defined the local distribution um in you know I just wrote it down like up here and these marginal distributions are basically mathematically derived based on the laws of probability from the joint right so anytime you see a big P that means it's derived from the joint okay so these are mathematically two different objects um it's not maybe obvious at first sight that they should be the same but you know they happen to be the same in all cases which is why sometimes people just mix up the two but I think just if you want to understand rigorously where things are coming from there's the local distributions that define the probabilities and then the marginalss which are derived okay so the probability of burglary in this case is uh 0. 05 5. Okay, so it's pretty low. Um, you know, that's as advertised. And then now let's do burglary given alarm only. Okay. So, um, how do I do conditioning? We've done this before. We first select based on the evidence. So, evidence says a uh is equal to one. I'm going to multiply a1, this selection tensor into the joint distribution. And furthermore, um this joint distribution um and I'm going to multiply this uh this selection tensor um here. And then here I'm also gonna simultaneously just erase uh E and A um because I marginalize them out because I'm only interested in B. Okay, so you can see kind of see why in sum is so powerful, right? Because I'm doing like two things here. I'm both selecting and I'm also marginalizing at the same time. Okay, so this gives me this uh tensor. So this is the probability of ignoring earthquake, probability of a burglary and alarm. So notice that the joint distributions are always small values. Okay, because both burglaries and alarms are pretty low probability. Now um now I'm going to I need to convert this into a conditional. So what do I do? I compute the probability of the evidence which means I'm adding these two things up a B to nothing and then I divide by the probability of the evidence and that gives me um this. Okay, so I'm going to take this tensor and I'm going to divide by 09. And usually when you divide by the evidence that bumps up all the scales up all the probabilities because they have to normalize to one. Notice that these don't sum to one, right? Um they would sum to one if you well actually yeah these don't sum to one. Um so when you it's only when you divide by the probability of evidence then you get a distribution. And just to remind you when you see a probability it's you sum when you sum over the left hand side that's when you get to one. The right hand side is you're not summing over that.

Segment 9 (40:00 - 45:00)

Okay. So the probability that there is a burglary um given that there's alarm is a little bit over 50%. Okay, this sort of makes sense. You see the alarm so you're thinking well it could be burglary or earthquake or maybe you know both. So it should be pretty high. You're you should be really worried at this point. Okay. So burglary is much more likely. Okay. So now we come to the final um you know everyone's in their seats waiting for the answer here. So what's the probability of burglary given that there's a alarm and earthquake? Okay, so here I have two pieces of evidence. So I'm going to define these two um selection tensors A1 and E1. Um and then I'm going to again do this like mega operation where I take the joint distribution probability of B, E, and A. I take this a selection um tensor A1 and E1. Um and then I'm going to multiply this to zero out the rows where the evidence is not true. And then I'm going to also drop E and A. And that gives me probability a b a is one and e is one. Okay. So let's see what that gives us. It gives us this. Okay. And then the rest is the same as before. Compute the probability of the evidence which means add these two numbers up and then divide by the probability of the evidence to give us the probability of burglary given alarm and earthquake. And you notice that now the probability of the burglary has indeed decreased to 0. 05. Okay. So the people who had kind of follow your intuition were correct. And so now the question is like wait wasn't there this thing about kind of independence here. So why does why would you know burglaries or knowing about earthquakes affect burglaries? And this is the intuitive reason is that if you think about no the why did the alarm sound. Okay, it must have been because of earthquake or a burglary. And when you see the when you observe earthquake then this explains away the alarm meaning that oh it's probably the earthquake that's generating the alarm not the burglary. So it sort of takes the pressure off of the burglary um and puts it on earthquake. That's sort of the intuitive reason why it holds. And this and the reason well I mean this is just math so it's it's just true. Um but you know in case you're still wondering about why there's this sort of you know independence what happens to this independence and the key thing here is that B and E are not are independent if you don't have A. If you don't observe A these are independent. But when you observe variables that are downstream of things that are independent and you observe them, that makes the variables upstream not independent anymore. Okay. And so the reason why I think Beijian networks are powerful is that um there's all you know people before I think they really nail these things down probabilistically. They had all these sort of horistic ways of reasoning about uncertainty and they were all sort of inconsistent and kind of broken and only when you lay kind of mathematical clarity to this then it sort of makes uh makes sense. Okay. So this idea of explaining away is actually quite general and powerful and it's sort of a defining feature of a Beijian network. Um so the general idea is that you have uh two causes in this case it's a burglary and earthquake positively influencing uh an effect a so you know if you condition on an effect and then you further condition on one of the causes that reduces the probability of the other cause in other words E explains away the evidence and you don't need the other uh cause to explain so in symbols. Uh the relationship between um the probability of B condition on both other cause and effect is less than the probability just condition on the effect.

Segment 10 (45:00 - 50:00)

And the important and the cool thing is that this is true even if the causes are independent. Okay. Um just a quick note about notation. I'll just emphasize this again. I'm using lowercase and uppercase deliberately here. Lowerase names um are represent particular values. So for example um I think up there I had uh let's see okay for example the probability of a random variable uppercase equals a particular value which is a lower case. Um so that's convention and then secondly um I'm using lowercase P to denote the local conditional dis probabilities which are given by definition and uppercase P to denote the margins and conditionals which are derived from the joint distribution by the laws of probability. So I already mentioned this. Um and furthermore if I write P of E that is the whole probability table and if I write P of E equals a particular value that is a probability which is a number. Okay. All right. To summarize what we've done so far is we've define a joint distribution via this four-step procedure. you define what the random variables are, what the edges between the random variables are for each node slash variable, um define a local conditional probability and then multiply them together to produce a joint distribution. So this is an intuitive way of defining these joint distributions rather than slavishly going through and writing out every uh probability. Now um we also did proistic inference which uh basically said select based on select the assignment rows based on the evidence compute the probability of the evidence and then divide by that probability and then so that's operate mechanically what um happens and then we also saw that explaining away was this really cool uh phenomena that happens It's a qualitative reasoning pattern that follows from um the math. So let's move on. Let's do another example. This is going to be similar to alarm um but it's a four nodes instead of three nodes. Um but hopefully it'll give just give you reinforce these ideas if you see it another time. So here the problem is um I guess a lot of people are sick uh these days. So if you have if you're coughing and you have itchy eyes, what is the probability that you have a cold? So I should preface this with saying I don't know anything about you know the don't trust the medical sciences. This is just an example. Don't um take this as medical advice. Okay. So let's model this problem using um a basian network. So step one, define the random variables. So what's going on here? There's um probably there's potentially you have a cold. Either you have a cold or you don't. Um you have allergies or you don't. You you're coughing or you're not. And if you have itchy eyes of course, this is a simplification because there's degrees of which you can have these. But again, this is just a toy example. Step two, connect the variables with directed edges. So um again using my uh very sketchy medical prior knowledge um we say that well um cold and allergies let's say they're just independent so there's nothing connecting them and um the so these are the potential uh problems and the symptoms are you cough um when either you have a cold or allergies and you have itchy eyes only when you have allergies. Okay, so this is what it looks like. And um so now let's uh step three, write down the local conditional probabilities for every variable. So there's four variables here and we're going to go through each variable and write it down. So, um, oops. Okay. Here we go. Um, so we have probability of C. So for C, I write down well, you know, colds are, you know, 10%

Segment 11 (50:00 - 55:00)

chance you have a cold. Um, allergies, uh, 20% chance you have allergies. Um here is um this is a little bit more complicated but this is what's the probability that you have a cough given that your state of cold and allergies. So this is a local conditional distribution where I'm conditioning on C and A and I want to know what the probability of H is. So here what I've written down this function as well if H matches um C or A then that's going to happen with probability 0. 9 and otherwise put with probability 0. 1. So before I just had one and zero but now because it's um just to make things a little bit um hedge our bets a little bit we're going to make them a little bit smoother. So 0. 9 and 0. 1. Okay. So for example, just to be clear, this entry says that if you have a cold and you have no allergies, then the probability you have a cough is 0. 9. Okay. And then the fourth variable is I uh given a. So this uh is the probability of itchy eyes given allergies. And I'm just going to assume that um it's 0. 9 if uh the two are agree. So probability of allergies um sorry itchy eyes given allergies is 0. 9. The probability of not 0. 1. Okay. So now I've defined these four local conditional distributions. I can define the joint distribution by just multiplying them all together. And this gives us this pretty horrendous mess. Um, but you know, conceptually it's uh hopefully clear that you just take the local conditional distributions and you multiply them all together and you just have to make sure that you put all the right indices here. Okay. So for each of these um four local conditional distributions you have to write down the variables in the order that it came and and just uh as a convention I'm you know this is written h given CA and I always write um the conditioning variables uh first and then the variable that you're the actual um variable itself last. Okay. So that gives me this um database table of uh probabilities and you can spot checks things here. So most of the time 6 of the time nothing's happening. Um and the probability that you have like a lot of things going wrong is much uh lower. Okay. So now we do probabistic inference. Um, we can ask her any question we want. So, what is the probability of cold given cough? And I think I'm going to go through these a little bit faster because it's the same kind of idea. Um, but you can go through the lecture notes afterwards. Uh, if so, you first select based on the evidence and marginalize out the variables that you're not interested in. So, we're only interested in cold and cough, right? So allergies and itchy eyes just forget about it. Just collapse them. Okay. So this gives us the probability of cough and sorry cold and cough equals 1. Um compute the probability of the evidence which is just adding these up and then dividing by the probability of the evidence to get a normalized uh answer. So the answer is the probability that you have a cold given that you have a cough is 28. Okay. So let's do another one. Uh let's um compute the probability you have a cold given you have a cough and itchy eyes. So now let me ask you this question. So this probability was um 28. What do people think about if you have itchy eyes? Does that increase, decrease or keep the probability of um the cold the same? Okay, so let's look here. So this is a little bit tricky. So I think a lot of tenuous hands there, which is uh to be expected. Um that's why we have uh math and you can just uh we can do it. Okay. So, so here is I I'll go through this quickly. So, you select based on the

Segment 12 (55:00 - 60:00)

evidence. So, you multiply this joint probability by h= 1 and i = 1 and that gives you a probability of um over probability of cold and h= 1 i= 1. compute the probability of evidence summing these two and then dividing and you see that the probability of cold has gone down. Okay, so at this point you're thinking like you know what what's going on here? So let me explain. So why is the probability actually let me ask you this question. Why is the probability of cold lower if you add it itchy eyes? So the probability of cold given just uh cough is 027 and the probability of cold given cough and itchi is 0. 13. So it's almost half. Yeah exactly. So uh basically this is an example of explaining away again. Okay. So and what was explaining away? You have an effect. You have two causes. But you know the itchy eyes is not a cause of this effect. Right? There's no edge from I to H. But I is connected to A. So if you observe you have itchy eyes that actually increases the probability of allergies. And now increasing the probability of allergies now explains away the cough and why you you're less likely to have a cold. Okay. So I think this is really neat how you know you just define the Bayian network and all these kind of reasoning patterns uh pop out. I don't think anyone probably would have like 100 really good confidence to just intuitively think well if you have itchy eyes how much um would the probative cough change but if you have the Bayian network it handles uh things for you. Okay, so to summarize this same recipe here joint distribution once you define it you can do probabistic inference we see that explaining way shows up again now a little bit more complicated and in general I think you could imagine let's say a 100 variable Beijian network right and you're conditioning on like 17 of the evidence variables and the all the variables are sort of interacting with each other and intuitively you can think about them as propagating evidence. If you observe one node that sort of suggests um the values of other nodes but then you know the two other values will also propagate. So um it's can be rather you know complicated and in general computing these quantities is actually you know can be empty hard. So it's not actually a pretty um sophisticated operation here. Okay. So finally we've seen two examples of Beijian networks alarm and uh medical diagnosis. So let me introduce the general case here. So in the general case the same four-step procedure you define a set of random variables. Often I'll denote these as x1 through xn. I define any directed graph a cyclic graph over the variables. So any graph um as but there can't be cycles. Um, and for each node I'm Xi. I'm going to define a local conditional distribution XI given the parents of Xi and then define the joint distribution by just multiplying all the local conditional distributions together. Okay, so important reminders because it you know it's easy to kind of make this mistake. You have one local condition distribution per node not edge. And also the local conditional distribution depends on all the parents at once. Okay. So don't forget to write down probability of C just because there's no edge from it. And don't write down probability of H given C times probably H given A. That's incorrect. You have to tie the um the parents uh together. In the literature they call this the parents are married. So just maybe if that's a helpful way to think about it. Um okay. And then finally know the difference between local distributions and the joint um distribution. Okay. All right. So then um when you do probabilistic inference you're given this basian network which defines this joint distribution over all n variables. There's some evidence usually written as E equals little E where uh big E is a

Segment 13 (60:00 - 65:00)

subset of um X's the variables. So for example, you can say I observe that X3 and X7 are 0 and one respectively for example and then you're interested in a set of query variables which could be any other subset of um nodes which like X2 and X4 and anything that's not mentioned like X1 and X5 X6 are assumed to be marginalized out. Okay. And the output is this probability table which gives us the probability of um every assignment to Q conditioned on E= E. Okay. So one um so BA networks actually show up in a bunch of places. Usually you don't think about them but I'll just give you some example. So for example, auto regressive language models um so if we put on Bayian network hats then a lot of regressive language model is actually a basian network. So the random variables are the tokens like suppose you're going to generate t tokens. So the tokens are x1 through xt. There's edges from all previous tokens to a current token. The local conditional distribution is given by this transformer. um usually in language modeling. And then the joint distribution is still the same. That's a product of um XT given all the previous XTs. Okay. So if you were to draw the network um for a language model, it would look something like this. So every it just happens that every um variable is connected to all previous uh variables and so on and so forth. Okay. So typically if you have a language model you just sample forward. Okay. So if you're given a prompt x1, x2, x3 and you sample x4 given the previous and I sample x5 given x1 through x4, x6 given the past and so on so forth. This is basically what you do in language modeling. But so now the interesting thing is what would proistic inference mean here? So probability inference really means kind of reversing and changing you can the whole idea of probability inference you can reason about any subset of variables given other subset of variables. So one thing you might want to do is what's the probability of suppose I see a response x4 5x6 I want to figure out what uh prompt generated that response. Okay, so this has an application um in jailbreaking language bottles. So for example, you could ask what is um a given response. Sure, here's how you make a bomb or something uh you know dangerous and then you ask the langu you basically problem inference is figuring out the prompt that leads to um the response. Okay, so this is very uncommon mostly because it's um language models are designed to just go forward. Um but is interesting to note that once you define a joint distribution you can mathematically go in any order you want and um going in different orders actually gives you some interesting you know insights. The downside is that this is extremely computationally expensive uh task um and it's in general exponentially hard but there's some this paper has some uh interesting ideas on how you can make this faster using variational inference. Okay, I just want to pop that example there so that you can see how this topic might connect to some more modern ideas. Okay, in the last 10 minutes I'm going to talk about two things. One is a different way to think about Beijian networks and then that's going to allow us to do something called rejection sampling for approximate uh pro Beijian inference. Okay, so far we've defined these Bayian networks as just writing down these local conditional distributions, right? And that's fine. That's better than writing down the whole probability table which uh no no one really does for large number of variables. But now we're going to define these via proistic programs. So there's a whole field called proistic programming which explores the interaction between

Segment 14 (65:00 - 70:00)

programs and distributions. So we'll give you a taste of this. Um and the ideas are fairly natural and elegant. So let's revisit our old alarm example. So we have the burglar burglary earthquake and alarm and I'm going to define a function a program called alarm that's just going to give me a random sample according to the joint distribution. So how this is going to work is I'm going to set B equals to this Bernoli 0. 05 05 and all Bernoli does is it just um returns one with this probability. So this is backed by MP you know random and then I'm going to set earthquake to be the same. So this is a random function and I'm going to set a equals b or e and I'm going to just return um the assignment. Okay. So I if I sample again um I think I'm might not get anything other than 00 zero because the remember the probability is 05 so it's pretty low but okay so that's it right so this is actually a formal definition of a bijian network it's just unusual because it's just a program but it's one way you can you know define and it's you know the thing I like about this is it's really transparent what's going on. Three lines of code. Okay. So now let's do proistic inference. So how would you do probabistic inference using this probabistic program which represents a bijian network and therefore joint distribution. Suppose you wanted to compute the probability of B given A equals 1. Okay. So because we've written the joint distribution in a program, this gives us a natural way to comput do this probabistic inference and it's a way called rejection sampling. And the key idea here is that I'm just going to draw a ton of samples just like this. I'm going to select the samples that match a piece of evidence and I'm going to record a histogram over different queries of the samples I see. Okay. So, um here I'm going to generalize the evidence. So, we've been thinking about evidence as a particular assignment to a variable and a query to be like a particular set of variables. But they can be any function really. Um so the query is going to I'm going to think about it as taking a sample and returning. So remember a sample looks like this. It's a joint assignment to all the variables and I'm going to just return some value. In this case, I'm going to pluck out B because I'm interested in probability of B. Okay. And the evidence is going to be a function that takes a sample and I'm going to return whether it matches the evidence. So I'm going to look at peak inside to look at the value of a and that equals one then I return true. Okay. So this is a evidence is a function that tests whether a sample satisfies a particular criteria. Okay. So now rejection sampling is literally five lines of code. Okay. Um for some definition of five. Um so I'm going to keep track of counts here. So I'm going to iterate number of samples times. I'm going to draw a and I'm going to test whether the evidence of that sample is satisfied and then if it is um then I increment the counts. I don't know if this in this run I don't think in this run I'll actually get a sample but if I did it would increment the oops um sorry the count and um oh I get one I guess. okay and then I have the so a count of the number of particular query values I see and then I normalize that to get a probability and I you know return it. Okay. So in that case it wasn't very interesting because I only got one um one value. But if I do this a thousand times, we'll see that um this gives me an estimate of the probability of um the query the histogram the query given the evidence. So remember the query here was probability of B. So the probability of B given A equals 1 given A= 1 is 0. 56. So remember the true number was 0. 51 or something. So it's close but it's not

Segment 15 (70:00 - 75:00)

exactly right because this is just a sampling algorithm. So this algorithm can be very inefficient especially when the evidence is rare, right? Because you're essentially sampling from the prior and if you have a rare event you're just like if the event happens with probability 0. 001 001 then most of the time you're just going to waste your time sampling. So it's not a very efficient algorithm but mathematically I think it's very clean. There is a guarantee that if you are very patient and you increase the number of samples to infinity then this actually converges to the true um probability. And in this class we've seen a lot of times where we do sampling like we do policy evaluation we just sample expected max we can sample and so on. Okay, let's do a um the medical diagnosis example. Um so in medical diagnosis remember we have we're interested in the probability of uh cold given cough equals 1. So here is the floor line network. So cough allergies and then um sorry cold allergies cough and uh itchy eyes. Um so again very simple and if I the query is uh C and the evidence is testing whether H equals 1 or not and then I do rejection sampling I get the probability of cough equals 1 given H= 1 is 0. 26 26 which uh is actually pretty close to the right answer. And you see that this is closer even with 200 samples because the prob the events of these the probability of these events are higher. Here's another fancier example and it shows you the power of um uh of probabistic programming or the succinctness. So there's these things called hidden markoff models. They kind of look like this. Um they can be used for object tracking. So imagine there is some object and h1 h2 represents the position of that object at successive times and then the there's you don't see the object position instead you get a noisy sensor reading of the object and so the goal is for example to figure out let's say you observe E5 um what is the probability of H3 okay so this is you know fairly complicated there's a lot of things happening here. Um, but rejection sampling is a fairly like you know blackbox way of handling this. So hidden markoff model here's what we're going to do. We're going to iterate for five steps and u for each step what I'm going to do is I'm going to look at the previous uh step and I'm going to add either one with probability of 0. 5 or zero with probability 0. 5. So 0. 5 probability I stay the in the same state probability 0. 5 I advance to the next state. Um and then this uh E is uh basically the noisy sensor observation where I get um I have my um uh true location and I either add one or add zero depending on but and then you return basically the sequence of hes and e. Okay, so this gives you a sample. The query here is um I'm interested in H2 two because it's a zero base instead of one based um and evidence of E4 and testing that equals two. Okay. So now if I do rejection sampling I can estimate the probability of um H3 given E 5= 2 and basically I infer that the object most likely was in position one uh but you know maybe it was in position um zero. Okay. All right. So to summarize um this section was uh brief but I think it has a lot of um important insights. The pros of programming is a way another way to think about programs as representations of distribution. So this is a sort of a pretty different way of thinking about probability um which I think is has some attractive features um because once you can sample you can do a lot of things. for example, rejection sampling. You can just draw samples and then compute or estimate at least the probability of various uh queries that you're interested in. Um okay, so quick maybe discussion here. One is that if you're

Segment 16 (75:00 - 77:00)

used to thinking about building classifiers, which we're all used to in modern machine learning, using basial networks requires a shift in mindset. So for example, if you wanted to predict whether you have a cold versus a cough, right? In traditional machine learning, you would say, okay, let me train a classifier that maps my input into output. So the input here is whether I have a cough and the output cold. But in basian networks, you're often going from output to input, right? Because the way you to think about is you know I am interested in the cold and that I'm defining a bay network that takes cold and generates what the are the things I observe and then you're using probabistic inference to kind of go backwards. So B why would you do this? One advantage is that it handles missing information. So for example, if your data points have a missing value, sometimes you the doctor didn't record whether you had itchy eyes or not. Beijian models allow you to automatically take that into account. If you have prior knowledge about how things work, like you know your alarm system, what it reacts to, you can just write down what it does instead of trying to learn that. Um you can interpret all the intermediate variables here. It gives you interpretability which is often missing from big endto-end systems. And then you can answer questions about causality which is beyond the scope of this class but um Beijian networks give you a way to do that. Okay. So to final summary so probability there's joint distributions you can marginalize them you can condition. These are all tensors you can use in ops. Beijian networks gives you a convenient way to define joint distributions. Proistic inference says given a joint distribution I can answer questions which look like probability of query given evidence. And then finally I showed you proistic programs as a way to represent programs as joint distributions. And finally uh rejection sampling which is a very simple and generic way to um do probabistic inference in any you know uh bijia network but it is very slow um and so next time we're going to try to improve the algorithm and look at other alternatives.

Другие видео автора — Stanford Online

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник