# Stanford CS221 | Autumn 2025 | Lecture 13: Bayesian Networks and Gibbs Sampling

## Метаданные

- **Канал:** Stanford Online
- **YouTube:** https://www.youtube.com/watch?v=Dk7Kqqehzjk

## Содержание

### [0:00](https://www.youtube.com/watch?v=Dk7Kqqehzjk) Segment 1 (00:00 - 05:00)

Last lecture we started talking about Beijian networks and Beijian networks is this new chapter on how to represent the world and in particular uncertainties about the world and last time we looked at rejection sampling as a way to answer queries about um Beijian note works um and this time we're mostly going to focus on Gib sampling which is a faster potentially faster I should say way to do problem inference and then we're going to talk about uh conditional independence. Um so let's start by reviewing uh bijian networks. So basian networks are probably one of the most kind of intensely deep part of the class. So um please ask me questions if anything is confusing. Um so to get us into the right mindset, we should think about joint distributions over a set of variables as like a database that specifies how the world works, right? So you have a state of the world that's represented by assignment of to a bunch of variables and the database basically says for every assignment how likely that is. So remember we define Beijian networks through a four-step procedure. The first step is in any given problem you have to figure out what the variables are. In general we talk about variables as X1 through XN. But in the specific example we saw um a case where we had three variables. One representing whether there's a burglary B. an earthquake E. and one representing whether the alarm went off. A okay so once you have all the variables then you define a graph over all the variables. So in this case um we intuitively think of burglaries and earthquakes as independent events but the alarm is going to depend on either the there's a burglary or an earthquake. That's what's advertised by the alarm. the alarm goes off if there's a burglary or earthquake. Okay, so now we have this graph um and this specifies the structure of the basian network. Think about it's it's kind of like uh the bones in some sense. Um and now the third step is to actually uh put meat on those bones. So the meat here is actually specifying numerically um the local conditional distribution for every node given the parents of that node. So in this case there are three nodes. So for each node B I have the probability of B given its parents. There's no parents. So it's just P of B. I have probability of E given its parents. There's no parents. So that's P of E. and the probability of uh a given its parents which are b and e. So I have probability of a given b and e. Okay, so the local conditional distributions are going to be denoted with a lowercase um p. Okay, so just to ground this out in code, we can look at probability tables. Um so for example probability of B represents this table which is for every setting of B I have a probability. So the probability of there being a burglary is of 0. 05. Okay. And the probability of not bearing a burglary is 0. 95. Okay. So same with earthquake. I have the probability of earthquake being 0. 05 05 and not being an earthquake 0. 95. And then finally I have this table um which represents the probability of there alarm given the status of um burglary and earthquake. So for all possible settings of B and E I have to give a probability of A. Okay. And this is a really shorthand for saying I first compute B or E. That's a boolean. And if that equals a then it's a probability of one otherwise it's a probability zero. Okay. So what this logically says is a equals b or e. Okay. So you can just check that if a equals b or e then I get a probability of one. If a does not equal b or e then I get a probability of zero. So that's a sort of a proistically verbose way of just saying a equals b or e. Okay. Okay, so given these tables, I can define the joint distribution over a set of random variables as simply the product of all the local conditional distributions. So in this particular example, the joint distribution written

### [5:00](https://www.youtube.com/watch?v=Dk7Kqqehzjk&t=300s) Segment 2 (05:00 - 10:00)

with a uppercase uh P is equal to just those product of local conditional distributions P of B, P of E, B of A given B and E. Okay. So, um we also saw that we can um use ein sum notation to simply multiply all these local conditional distributions which are just tensors together. And so you have P of B, P of E, uh P of A given B and E. And then you just make sure you index the right variables. B and E for the first case and then BEA for the last um local conditional distribution. And that gives me the joint uh distribution. Okay. So I could have just written down numbers for all these eight possible assignments but the way we did it is going through this more qualitatively inspired process of defining the um the Bayian network. Um this gives us a little bit more interpretability than just writing down some arbitrary numbers. So what do you do with a basian network? Yeah. So we infer the uh probability of certain variables given other ones. So this is a joint distribution. You can think about it as a database. So literally it can be represented as a table but instead of using uh SQL you use the language of probabistic inference. So um you might be interested in what is the probability of a there being a burglary given the alarm went off. Okay, that is one particular query and um the way you compute this um is as follows. Um so I'm using slightly uh I'm going to introduce another step here. This is different from last time. I think this is a bit more intuitive. Um so the first I have um up here I have just the probability of B and A right and then in general proistic inference query looks like I'm interested in the probability of some query variables given some evidence. Evidence here is a equals 1. So what I'm going to do is condition on evidence. So I basically take the probability tensor of be EA and I'm going to slice it. I'm just going to take um so this corresponds to B. This corresponds to E and this corresponds to A. Right? So when I say E A equals 1, I'm just going to take the slice where I only look at A equals 1. Okay? So, um, what this looks like is I take this table and I'm going to look at all the rows where a equals 1 and I'm going to select them out. Okay, so these four rows are exactly the rows where A equals 1. Okay, you can check that. Okay, so now um I'm interested in B given A= 1. I don't care about E. So what I'm going to do is marginalize E out. And that basically corresponds to remember marginalization is just kind of ignoring E. So I'm going to collapse entries that have um different values of E but the same elsewhere. So I'm going to collapse these first two rows and these last two rows. And then in um this notation um I can use Ein sum to do that as well. So that gives me um probability of B and A= 1. Okay. So that's probability All right. So the next step is um this is the probability of B and A= 1. I want condition on A equals 1. So to condition it's just the you know the rules of probability that I need to first compute what is the probability of the evidence p= a = 1 and that conveniently here is just the sum of these two items. Okay, so if I sum these two items, I get um the probability of a equals 1. And then finally, in order to get the probability, the joint divided by the marginal gives me the conditional. So I'm just going to divide each of these two numbers by this number and I'm going to get my answer. So finally the probability of uh the burglary given alarm equals 1 is 0. 51 which is what we got last time. Now one downside is that forming the joint distribution in this way can be really slow. So imagine you have a 100 variables. Each variable has two values.

### [10:00](https://www.youtube.com/watch?v=Dk7Kqqehzjk&t=600s) Segment 3 (10:00 - 15:00)

Then the total number of assignments is two to the 100. And you definitely don't want to be writing that down. Okay. So what we're going to do is do approximate inference. So we're going to try to approximate the answer um and hope for the best. Okay. That's all we can do. And uh we took a little bit detour to explain probabistic programs which is another way to think about Beijian networks in order to introduce rejection sampling. So idea behind a probabistic program is that it's a program that returns a sample from the joint distribution. Okay. And therefore it defines the distribution. Okay. I give you a program and if samples from it obey some distribution then that's as good as the definition in some sense. Okay. So here's an example of the alarm probabistic program. I um just basically go through each of the variables in the Bayian network and I set its value given its parents. So naturally has to you know a node has to be generated be after its parents are generated. So I flip with probability 0. 05 I set B equals uh one. So usually I get zero E I get zero and A equals B or E I get zero. So that is one particular assignment. So when I run a proic program I get an assignment of variables to values. Okay. So if I run it again, I'm going to get potentially a different set independent of the first. Okay. So then rejection sampling as we talked about last time is um I first declare what part of the assignment I'm interested in. What is a query variable? So here I'm interested in B. So given a sample I want to extract B. And then what is the evidence? I'm going to represent the evidence as a a test of whether the sample satisfies that evidence. And here I'm going to look up the value of a and check that it equals one. So remember I'm still computing probability of b given a equals 1. Okay. So let's step through the rejection sampling algorithm just taking one sample. Um, so I'm going to just for each number of samples draw a sample calling the probabistic program that might give me something like this. Um, and then oh I got an earthquake. Wow, it's very lucky or not lucky but anyway. Um so if evidence uh of sample so remember evidence is this test that test a equals 1 and if that is true then I increment the count and I can't increment the count of I take query of sample which gives me uh the value of B which is zero and I increment that okay and I loop until I'm at the end I look at the total number of counts and um I normalize by that count to get the probabilities. Okay. So why is uh here's a question. Why is total count not the same as number of samples? Yeah. So only some meet the evidence because of rejection, right? I only count a sample if it's um meets the evidence, right? Okay. So that gives me the probabilities and if I do it 300 times then I get um this distribution which um 044 probability of burglary given alarm equals 1. So notice that it's in the ballpark. It's not um this is the exact quantity that we computed using the um the actual tensor. Um here is the approximate and approximation with 300 samples. Now if you increase the number of samples to a million this is going to get really close but at 300 we're still um in the rough only the rough ballpark. [snorts] Okay. So one problem with rejection sampling is that it can be quite slow if the evidence is rare. Okay. And this in this case I actually got extremely lucky last remember last lecture I wasn't able to get any earthquakes or burglars. Um and uh in general if you're conditioning on some evidence maybe it could be exponentially small. Maybe it's like the probability of hitting the evidence is 10 to the minus 30 or something and then you would just be sitting around and rejecting rejecting rejecting. So it can be a very slow algorithm. Um, and the problem with this is

### [15:00](https://www.youtube.com/watch?v=Dk7Kqqehzjk&t=900s) Segment 4 (15:00 - 20:00)

that if you think about the program, I'm just generating samples and I'm not even looking at the evidence and I just hope that the program happened to generate the evidence. So while I really like rejection sampling because it is very simple like within in some sense this is the fastest path to defining a basian network and also a inference algorithm that is actually provably works in the limit but it is not the most efficient way to do it. Okay. So then how can we do this faster? That's going to be the topic of the next uh part of this lecture. Okay, so now I'm going to talk about Gibbs sampling. So remember rejection sampling, what it looks like is that I draw a sample, I might reject it, reject it. And so the nice thing about rejection sampling is that every sample is independent, right? So if I'm estimating some quantity like 10 samples is 10 independent, you know, samples. Um, but the cons are that every time I'm starting from scratch, right? So every it's sort of the flip side of this is that this sample doesn't I'm generating fresh. I don't even consider what happened before. So I'm not really kind of learning anything about um my past experiences. And the second thing is that as I mentioned before, I'm generating a sample without looking at the evidence. So if the evidence is rare, that's going to cause a lot of rejections. So Gib sampling is another sampling procedure. Um but here I'm going to the idea is that I'm going to work with a previous sample which I'm going to construct to always satisfy the evidence and then gradually modify it over time. So it's a much more incremental approach rather than throw everything out, sample ar, and hope that you hit the evidence. Okay, so here's a basic idea of gib sampling. I'm going to start with an arbitrary assignment and then I'm going to iteratively change one variable at a time given the other variables and I'm going to do this in a very specific way so that I'm going to get a bunch of samples and those samples are going to give me also the kind of the correct answer um in the limit. Okay, it is worth noticing that or noting that give sampling is a special case of a wider class of algorithms called markoff chain Monte Carlo. So if you study probability or statistics, you might encounter these things. And the idea is that you have a sequence of samples which are drawn and each one is dependent on the other. And that's why it's called a markoff chain because each one is dependent on the previous sample. And the idea with sampling is that we're going to start with samples that satisfy the evidence. So from the first get-go, I'm going to incorporate the evidence. So I don't have to reject anything. There's no rejection in um GIF sampling. Now the con is that the samples are not going to be independent. So this sample might be very similar to that one. So if you could get 10 independent samples from rejection sampling, that's going to give you a better estimate than 10 samples from Gib sampling where the samples are correlated. Okay, so I still haven't told you what GI sampling is. So let me try to go through that um using a simple example. So how many have played the game telephone? Okay, maybe when you're really young, but uh but the idea is that you have some person A who sends a message to person B and then the person uh um uh the person B sends a message to C. And in general, this could go on for a while and then you basically see if the message you got out at the end is the same as um the one in the beginning. And of course the whole point is that every point in time there's a chance that the message gets corrupted. So at the time at the end the two are not the same and everyone bursts out into laughter. Okay, that's uh how it works. Um so let's model this using a basian network. So I'm going to assume that the message is single bit either one or zero. And the probability of the first message is going to be given go by a coin flip either 0. 5 probability of zero. 5 probability of one. Okay. So then I look at probability of b given a and that's going to be essentially if

### [20:00](https://www.youtube.com/watch?v=Dk7Kqqehzjk&t=1200s) Segment 5 (20:00 - 25:00)

point a probability of uh saying the same and point 2 probability of flipping it. Okay. So if a is zero then the probability of b equals 0 is8. If a is one the probability of b equals 1 is 08. Okay. And c given b is the same. So point a probability of preserving the message and 2 probability of flipping it. Okay. So this defines another basian network. Um and it defines in particular joint distribution over a b and c. So now the probability inference query I want to ask is assuming C the person at the end gets a one what is the probability of A equals 1 or zero. Okay, so intuitively you know if you saw C equals 1 that probably increases the probability of A equals 1 because the probability of a error is less than 50%. But there's noise so it's not going to be certainly one because there could have been flips. Okay, so that's the intuition. Um so let's first rerun rejection sampling. So you shouldn't necessarily in general be expected to know like what the exact value is. Um so let's try to compute it. Okay. So here is a probabistic program. I have um a I sample a one. B I decided to preserve the message. C I decide to preserve the message and I get this sample. Okay. So then I'm going to do rejection sampling. I'm interested in uh the query which is what a is and then the evidence I'm going to condition on c = 1 and I do rejection sampling um with 100 samples and I get the probability estimated probability of a = 1 to be 65 okay so seems plausible every point there's a 02 probability of um error so it's sort of if you add that you get point around 6. That's sort of like sketchy math. A more precise estimate is um you can compute. All right. So let's do gib sampling. So what we're going to do is here's a pseudo code. We're going to start with a particular initialization to all the variables and then I'm going to roundroin through the variables and I'm going to change it based on all the other variables. In particular, to change it, I'm going to compute the conditional distribution over a variable given all the other variables and then sample a new value for that variable given this conditional distribution. Okay, so let me walk through an example to explain um what that looks like. So here I've initialized um with an arbitrary samples that satisfies the evidence. So this should be c= 1. Um so a is one, b is zero, c is one. Okay, the initialization um generally doesn't matter too much. Um but uh sometimes for certain problems um it could make your um make your convergence faster or slower. Okay, so um now let's look at what this conditional distribution is. Okay. So suppose our goal is to sample B given A and C. Okay. So A is one, B C is one, B is zero right now. And if I think about basically sampling is I what is the probability of B given A and C? So what's my best guess at what B should have been? Okay. So to do that um we're going to do some basic math. So the probability of B equals some value given A and C is the joint probability of A, B and C divided by the probability of A and C. So all the variables except for the one I'm sampling. Okay. So now I let's break this down. So the normalizer which is the thing I'm dividing by is a probability of all the variables we're not sampling and that is simply equal to the sum of over values of B of the joint and then what is a joint probability is a product of local conditional probabilities which are given um above um as I define them and note something that's important here I'm writing this down but I don't have to instantiate the

### [25:00](https://www.youtube.com/watch?v=Dk7Kqqehzjk&t=1500s) Segment 6 (25:00 - 30:00)

joint distribution for all possible values of A, B, and C because in GI sampling, I'm only looking at a particular assignment and I only care about looking up the probability of that particular assignment or the assignments around that assignment. Okay, so this is where I get the savings. Although if I have to instantiate this for all values A, B, and C, I might as well just brute force the whole thing. Okay. So, um we do have to as stated touch every variable when evaluating the assignment. We'll come back to that later. Okay. So, suppose we're sampling B. Okay. [snorts] So, I need to compute what the value of A= 1, B = 0, and C = 1 as well as the probability of A= 1, B= 1, C = 1. Okay. So basically what I'm saying is I'm freezing all the other variables. I'm only changing B and I'm going to evaluate each of these new assignments based on the joint probability. Okay. So let's look at how this is defined. So um joint probability is a prob is a function. It takes a particular assignment and then it is in some sense making a modification. So it says I'm going to change var some var into value. So I might pass B for var and one to value. And then so here I'm going to um let Y be the updated assignment. So basically taking this original assignment and substituting var colon val here. So here I basically change one zero one to um okay well I guess it's a B 0 is the same uh so nothing gets changed there and then I compute the joint probability of the updated assignment and this is going to be that tensor so probability of A um given Y A which is um one here and then probability of B given A um using these assignments And so uh zero given one and then probability of C given B. Okay. So that gives me a probability of 02 for the probability of A= 1, B= 0, C= 1. And now I do the same thing for the alternative value that I'm considering which is B = 1. And again I um y is now I update this assignment to be 111. I compute its probability and I get a number out. Okay. So um and now I compute the normalization factor which is the sum of these two things. I get 34 and then I divide um each of these quantities by P of AC and that gives me a valid distribution over B. So now given this distribution um 05 for b= 0 and 0 94 for b = 1 I can sample this b and notice that I modified x to be 111 now based on that sample and intuitively what's happening is that um you know one is much more likely than one 01 one. So um B gets pulled to one because of A= 1 and C= 1. Now this is probabilistic with 5% probability. I could have just left one 01. Okay, but it's more likely that A started with a message one and then it sort of preserved the message along rather than um A flipping to zero and then flipping back to one. I think the most important thing is that basically you have an assignment and you're considering one particular variable and I'm going to look at different values that variable could be and then I evaluate the joint probability of each of those kind of possibilities and then I sample according to those probabilities. Okay, so here's skip sampling in general. So I'm going to define the skip sampling function. Um I'm going to pass in the initial assignment. Vars is going to give me the variables I'm going to

### [30:00](https://www.youtube.com/watch?v=Dk7Kqqehzjk&t=1800s) Segment 7 (30:00 - 35:00)

actually sample. um query is the same as rejection sampling where I'm interested in let's say a partic in a um and a joint probability is um how I've defined how I decide to score uh variables and I'm going to do it for one iteration. Okay, so let's do gib sampling. So I start out with um x. Okay. And then um I'm going to track counts of the query values as before as in rejection sampling. I'm going to iterate um number of iteration times and for each variable. So here the variables I'm sampling are A and B. Um why don't I sample? Why don't I include C? Yeah, C is the evidence. that's uh the what I'm conditioning on. So I've conditioned on c equals 1. So I don't want to change that. Okay. So um let's consider a. So that means I'm going to try to modify a. Okay. So what I'm doing here is I'm going to consider all possible values that A could take on and for each of them I'm going to compute the joint probability of the assignment where variable has been assigned to A. Okay. So that gives me um uh basically for a I have the joint probability of 08 and uh 32. Okay. Um now I compute the evidence the probability sorry not of um a sorry b and c which is the sum of these two things and I normalize so here I'm just normalizing the probabilities to 2 and8 and then I sample according to that and in this case I drew a one so you know nothing changed and then here I'm going to record the count of the query value. So I look at this assignment and I see A equals 1. So I'm going to write down I saw one once. Okay. And then now I go to the second variable B and again I'm going to compute the joint um you know probabilities here. This is actually what we just did before um in the specific example. So I normalize the probabilities um and I sample um and here again I didn't change anything um and then I record that I saw a equals 1 another time and so I keep on doing that for some number of iterations and then at the end I um keep track of um the probabilities. So that's one round of uh Gibbs Gibb sampling. So now if I run it for multiple iterations then what do I get? I get that the probability of so I'm interested in a query which is the probability of A given everything is given C equals 1. I get that the uh probability of one a= 1 given c= 1 is estimated to be 68. Okay. And then if I compare that with rejection sampling um it's sort of in the same ballpark. So these are random algorithms or different algorithms. So we're not going to get exactly the same thing. Okay. So let me actually draw a picture. Maybe I'll can help um visualize what's happening here. So I have variables A, B, and C. So C equals 1. So I'm not going to touch it. Okay. So this is over time some number of iterations. So I start out with one 0 uh one. Okay. And then I'm going to sample a according to some probability. So let's say I do one 01 and now I'm going to sample B. Um so let's um let's do let's do this. Um okay so I'm going to sample B. So I'm going to let's

### [35:00](https://www.youtube.com/watch?v=Dk7Kqqehzjk&t=2100s) Segment 8 (35:00 - 40:00)

say sample one because remember that's more likely. Um and then I go back to A and I sample um let's say I got unlucky and I get becomes a zero and I sample uh B. So that becomes um let's say it's a one. Um and now I sample a maybe that becomes a um so I sample a that becomes I know a one um and then I sample B now. Um sorry I'm getting confused here. Okay. A B A that shouldn't be there. Um and this sort of continues and dot dot and then once how do I compute the estimate of the probability? So each of these columns is a sample, a joint sample. And so what I'm going to do is now I'm going to look at if I'm interested in let's say probability of um A = 1 given C = 1. Well, C= 1 is already taken care of because I never change it. So I'm going to look at only the query. these values and I'm going to just count how many ones um there are and the fraction of ones is my estimated um so this is like the fraction of ones. So that's going to be my estimate of the probability of a uh given c equals 1. Sorry that must have been a mistake. So this um so I'm here I'm sampling B so A should not change. Yep, you're right. So each column I'm only changing one variable at a time. [clears throat] Okay. And the reason why you only change one variable because it keeps it tractable, right? Because when I'm fixing everything else, I'm only looking at one variable that means I can consider all the possible values of that variable and in some sense I'm like brute forcing that variable. Let's see what that variable value could be and then I'm sampling according to that distribution. Yeah. So in the code for every variable I'm considering all possible values right and so that's the same here what I'm not showing the computation of the probabilities I'm just saying I pick this variable and I'm going to consider a zero here I'm also going to consider one here and I'm going to sample and I happen to choose a one Yeah. All right. So the running time of this algorithm is the number of iterations times the number of variables because I go through all the variables per iteration times the size of the domain which is the number of values that a variable can take on. In this case it's two and times the number of variables because every time I consider a possible assignment joint probability has to go through all the variables to evaluate the joint probability. Okay. So summary the basic form of algorithm is I'm going to update one variable at a time using its conditional distribution given all the other variables and then I'm going to compute the conditional distribution using the joint distribution. And here I only have to um sum over the domain of the variable we're sampling. Okay. Um so next I'm going to show you how you can make this a bit more efficient um by leveraging the property of the Beijian network. Okay. So let's do that and that's going to introduce a new concept called the markoff blanket. Okay. So imagine um this is some arbitrary um you know probability over five variables A B CDE E um and if you were doing Gibb sampling here you would every for every variable for every value of that variable I have to consider the whole set of variables in my entire

### [40:00](https://www.youtube.com/watch?v=Dk7Kqqehzjk&t=2400s) Segment 9 (40:00 - 45:00)

basian network and so can you avoid doing that and intuitively it seems like you because only one variable is changing. So why do other variables matter? Maybe they don't matter. Um it's not going to be you know completely like you can't just ignore all the other variables because the variables interact but the inter these interactions are potentially local. Okay. So remember this telephone basia network. Um so A uh generates B generates C. Um okay so that's the same thing we had from before. Um so now suppose we're sampling um probability of a given b and c. Okay. So what we would normally do is we're going to look at the joint distribution of both a equals 0 and a= 1. So I'm considering what should a be? So I'm going to consider both all the values a equals 0 and a= 1 and I fill in my current assignment for all the other variables and this is just you know just by the definition of the basian network the probability of a given b given a time c given b. So um when I compute the probability I also have to normalize. So I sum these two numbers and I get probability of B= B and C= C. And then I divide each of these quantities by this normalizer and I sample from that. Okay, so that's what we did normally. But there's something that we might notice about this expression that could um allow us to save some work. So if you look at this expression, you know, um I care about basically what's happening to a. So a is changing either zero or one. Um but if you look at this last factor here, it doesn't depend on a okay. So it's just some number. And in particular, this is this number that gets multiplied by this a equals z. And if a equals one, it's the same number. And in the denominator, I'm going to also have the same number. So I'm basically multiplying and dividing by the same number. And you can sort of see that if this is 0. 1 versus 2, it doesn't really matter. So therefore, I can just drop that completely and I can just ignore it. Okay. So in general there's a question of which variables can you drop and which ones you can't. And there's a kind of very nice way to think about that. So we need to only include variables that touch the variable we're sampling that interact with it directly. Okay. So B is connected to A. So therefore I can't drop you know I can't drop this factor. um but C is not connected to A so I can drop it and so the set of variables that I need to you know consider is called okay and the markoff blanket of a variable is in a basian network is simply all its children and its parents so let me just draw this visually so in general you have a variable let's say you're sampling and has a number of parents and maybe these have other parents. Maybe this is like really complicated. Um and maybe this is it could be kind of as long as it's a DAG, it's fine, right? So I'm just going to draw some stuff and suppose I'm interested in this node. What is the Markoff? um you know the Markoff blanket of this node of um you know this node is all its children. So this is these nodes are in the marov blanket and all its parents. Okay. So these are all the local conditional distributions that uh touch depend on this value. So this is not in the markup blanket. This is not either. Okay. So when I have when I compute the joint you can imagine just writing down the joint remember a local conditional distribution for every variable. So I

### [45:00](https://www.youtube.com/watch?v=Dk7Kqqehzjk&t=2700s) Segment 10 (45:00 - 50:00)

have this condition on its parents and then I which includes this condition on its parents this condition on its parents. Okay. And then everything else doesn't depend on the orange node. So the question is do you have to include co-parents? So for example, let's say this is just to make it more interesting. Let's say this depends on that. Um so the answer is yes because uh if you think about let's just label this um xyz. So this factor is going this conditional distribution is Y given X and Z. Okay. So this acts as a unit. The parents always act as a unit. So if you're changing this, you have to depend on the value of Z to determine what the probability of Y is. Yeah. Okay. So for this just for maybe a simpler example um so let's do this example. Okay. So what's the Markoff blanket of uh A should be B. B? — A and C. And what about C? — Okay. So I think we're good. Okay. So, so then what it means is that instead of computing the joint probability over the entire assignment, what we're going to do is only include the terms that are in the markup blanket. So for this particular three variable network, what it looks like is uh the following. So I have a joint uh assignment X and I updated v this variable with this value. And if it's A, I just have to include A and B given A. Um I don't have to look at C given B. Um and if it's a variable equals B, I only have to look at B given A and C given B. and I don't look at probability of a. Okay. So if I run gib sampling using passing in this function to evaluate the pro basically a score for every assignment then I get um this quantity. Okay, which is uh similar to what I had before, but this is a lot faster. Actually, this should compute exactly the same answer as gape sampling with joint probability up to like floating point precision. [snorts] Okay. And there there's a speed up here because instead of contain considering the total number of variables for evaluating this probability of an assignment, I only have to consider the size of the Markoff blanket um for evaluating the cost of this particular assignment. So this I would say is a is an optimization. It doesn't resolve any sort of fundamental issues with Gib sampling u but it's just like why not do it. Okay. So the quick summary here is that the markup blanket of a node is all its children and parents. Give sampling requires the conditional probabilities of um the node and it's its markoff blanket. So any local conditional distribution that touches the node and its markoff blanket [snorts] and um this is much more efficient if the Markoff blankets are small. So um large, if you have a node that's connected to many nodes, then it doesn't really save you very much, but it doesn't hurt either. Okay, let's move on. So now let's apply G sampling to this alarm uh network which we are familiar with. So remember we have burglary, earthquake giving rise to alarm. So here is the um the probability tables. Um we're going to initialize with an arbitrary sample that satisfy the evidence. So assuming there's a burglary, assumes there's everything bad happens. Um and then let's look at uh um we're interested let's say in

### [50:00](https://www.youtube.com/watch?v=Dk7Kqqehzjk&t=3000s) Segment 11 (50:00 - 55:00)

burglary given um the evidence [snorts] so where which is a equals 1 and um I'm going to see what happens if okay so if I run gib sampling here I get um probability of B = 1 given A= 1 is 6. So if you guys remember that's not exactly the correct answer. It's not it was more like 0. 51. Um if you run it again for 200 iterations it's uh 33 now. So you might wonder this is kind of not working super well. Um, so let's try to understand why. And so as I mentioned before, these sampling algorithms are just approximate and your mileage is going to vary depending on what situation you're in. So it's very important before you try one of these algorithms to be attentive to um when it's going to work and when it's not. Okay. And by the way, I'm also going to talk about when u rejection sampling works and when it doesn't. Okay, so let's actually start with rejection sampling because I think it's I've already said this, but just to put it in um concrete terms. When is it hard for rejection sampling? And the answer is when the evidence has low probability, right? Because you're going to reject most of the things. Um the number of samples that you actually keep is basically the one minus the probability of the sorry evidence. [snorts] So if the evidence is low probability that means you're going to have very few samples to work with possibly zero even. Okay. So here is an example where um it's sort of telephone except for the probability of um actually sorry it's not telephone here's the there's the problem. So if it's a one it's the probability of a zero is very high. So if a is zero b is zero with 0. 999. If a is one n998. Okay. So if you look at the joint distribution here, the probability of um b equals 1 is going to be quite small. But what I'm interested in is probability of a given b = 1. And intuitively, if you think about a pvp of a given b= 1, you're taking these two rows and it should be uh 2/3. So the ratio between these is two over three. Um but you know you need to reject all the samples that don't match the evidence. So if you were to do rejection sampling you would get um basically you would reject um like you would only keep one out of 10,000 samples. So this is a very inefficient way of uh sampling and it would probably take like a hundred thousand or even more samples to get a fine estimate. But if you do gib sampling this is totally fine right because what gib sampling says is I'm going to condition on um b equals 1. So immediately I restrict to these only these two rows and I normalize and then I can in fact gib sampling will just give you the exact answer every time. Okay. So what about the other way? So what examples are hard for gib sampling and the answer is when the variables are highly correlated. So let's take another example here. So two variables A and B. [snorts] uh a is 0. 5 probability on zero and one and here b equals a. So probability one a and b are the same. Okay. So now let's say I want to do gib sampling and I just want to compute the probability of a. Okay. Just question what's the probability of a? So it should be 50/50, right? Okay. So here I'm not con evidence there's no nothing I'm conditioning on there's no evidence here. [snorts] So um rejection sampling will get nail this because it generates from scratch every time and works perfectly. There's no rejection because there's no condition.

### [55:00](https://www.youtube.com/watch?v=Dk7Kqqehzjk&t=3300s) Segment 12 (55:00 - 60:00)

Okay. So it's going to have no problem detecting that P of A is um is 0. 5. But what happens with Gib sampling now? GIP sampling will get stuck. Okay. And the reason it will get stuck is that um let's say I start out with uh 0 0. Okay. So now I sample try to sample A and what happens when I try to sample A? Well, I can come over here and I look at okay, B is zero and we're going to consider possible values of A. It could be zero or one. Well, if that's one, oops, that has zero probability. So therefore, I have to choose zero and therefore and same with B. If I try to sample B, it is always sticks the same. So basically safe sampling will get stuck completely in 0 and not be able to move and the estimated probability of um a is going to be probability one on zero which is the incorrect answer. So this is an instance where a and b are as correlated as possible here. they're identical and um and in that case because Gib sampling is just trying to change one of the variables then you can't do it because um the other one needs to change as well. Okay. It's I don't know. It's kind of like doing a three-legged race with someone. You're like attached to them and like you can't really move until the other person moves. And even if this weren't exactly identical, if this were 0. 9999, that would also be pretty bad because then it would be very unlikely for GI sampling to change into a different variable. Okay, so we saw kind of sort of complimentary examples which uh are bad for Gib sampling and rejection sampling. So rejection sampling rare events are hard and for um Gibb sampling highly correlated variables are hard and the bad news is that in most real world problems both are true. So both it's going to be hard for both keep sampling and rejection sampling but it's helpful to know the reasons why um a problem is hard. So a few remarks here. So Gib sampling as I mentioned is example of a MCMC algorithm. Um there's a more general MCMC algorithm called Metropolis Hastings which uses a proposal distribution to more smartly guide the search. Um and there's actually some really beautiful theory around Markov chains and analyzing how fast they run and it has to do with this thing called mixing times. Um in practice um gib sampling is fairly commonly used because it's fairly simple and effective. The only simpler thing is rejection sampling which generally doesn't get used because the probability of hitting the evidence is generally pretty low. Um but give sampling can also be quite slow. So now I'm going to switch gears and talk about conditional independence. So this is going to be kind of different. And so far what we've done is we focus on probabistic inference algorithms in Beijian networks and they've been mostly agnostic to the structure of the Bayian network right so you um in rejection sampling you just you know sample all the variables and then you reject it doesn't matter what depends on what um now with GI sampling we paid attention to the structure a little bit we defined the markup blanket so when you're computing the probability ility of a variable um you're looking at the things around it um but there's some really interesting deeper structures um in bijian networks so let's try to explore them a bit and in particular you know a bay network is simultaneously a proistic object it is a distribution over a bunch of v variables and is also a cominatorial object it's a graph right and there are there's the connection between the graph properties and the probab probability properties in particular when we talk about independence. Okay, so hopefully you guys have seen independence from your probability

### [1:00:00](https://www.youtube.com/watch?v=Dk7Kqqehzjk&t=3600s) Segment 13 (60:00 - 65:00)

class. But just to review it, um, two variables, random variables are independent if and only if you look at the probability of A and B, that's the same as the probability of A times the probability of B for all possible values of A and B. Okay, so that's the definition. So let's look at some examples where A and of B corresponds to variables in Beijian networks. Okay, so example one. So um I'm going to try to do some of these on the board actually. Okay. So this is the world's second um simplest Beijian network. The simplest one is I guess one variable. Um and so this corresponds to the joint probability just P of A* P of B. And in this one A and B are independent just almost like if you pattern match that's exactly the definition of independence. Okay. And notably A and B are not connected. Okay. So independence corresponds to in this case to two nodes which are disconnected. So if you see bijian networks where you have two variables that are not connected then you can reasonably say that they're um independent. Now we'll see later it gets a little bit more complicated but in this case um that's the generally the right intuition you should have. Okay. So what about uh if I put a um arrow between them? So that means the probability of A and B is probability A times probability of B given A. And in this case A and B are not independent because B here depends on A in some way. Okay. And here they are connected. So they I mean just intuitively they're kind of connected and depend on each other. Okay. So now let's look at a different um graph here. So let me just ask a question. Are A and B? So this is the ind independent uh symbol A and B independent here. Okay. So let's just try to figure this out. Okay. is just um let's look at the joint distribution over a uh b and c that's the product of local conditional distributions and I'm going to look at probability of a and b um that's the marginal here and that's the summation over c over the joint and then write it out prob summation of c over the local conditional distributions and notice that um I can move this summation over C over to this side and that gives me one. Okay, let me just make sure this is an important point. So I want to make sure that um so summation over C of probability of A probability of B probability of C given A and B. So this is equal to probability of A times probability of B times the summation over C probability of C given A and B. And whenever you sum a local conditional distribution over the all the values on the left hand side uh what do you get? What is this value? This is equal to one. Right? So therefore this is just probability of A and B. [snorts] Okay. So therefore A and B are actually independent. Okay. So this is the part that's a little bit tricky about Beijian network. So you can't just only look at whether two variables are connected or not. Um but this is just the only real special case that you have to think about. Okay. So intuitively we already actually saw this before, right? So we had burglary and earthquake and alarm and we said that burglary and earthquake are independent. Um and it just happened that there's stuff that could depend on both, right? So you can have two independent things and then someone creates a variable that depends on those things. it doesn't it shouldn't change anything about the underlying causes independence. Okay. So mathematically you can see it this way but intuitively you can

### [1:05:00](https://www.youtube.com/watch?v=Dk7Kqqehzjk&t=3900s) Segment 14 (65:00 - 70:00)

see um why it might um be true. Okay. So that's the that's an important case. So let me just um just write it. So a yes. Okay. So here is another case. Um so C and A and B. Okay. So um are A and B independent? How many of you think yes? How many think no? Okay, the answer is no, right? So, don't be fooled. It looks kind of like that. If you're looking at upside down, then they look the same, but the direction matters. [snorts] And in particular, these things visually look very similar, but they're like almost completely different Beijian networks, right? This one in particular has C given A and B and this has C probability of A given C times probility of B given C. So the directionality matters and so if you look at the joint distribution probability of C A given C B * B given C and if you look at the probability joint probability of A and B that's equal to summation over C times probability of C. I mean, it's just kind of like a mess. There's no way to simplify that. So, um, they're not independent in general. And sometimes it can be useful to think about if you see a structure, just think about special cases in your head, right? For example, I can say, well, C is a coin flip and A equals C and B equals C. And clearly, they're not independent because A can be literally equal to B. Okay? So you can somehow work out these some of simple examples in your head um to kind of convince yourself when things might not be independent. Okay, so that's independence. So this is definitely no. Um so let's look at conditional independence now. Okay, so this is also a familiar concept from probability and we say that two variables A and B are conditionally independent given C equals some particular value if um basically when you it's the same thing except for you condition on C in all the different terms. Okay, so let's look at some examples. So let's revisit uh this example here. Example four. Um so the question is uh A and B given C are they conditionally A and B conditionally independent given C. Okay. So here let's look at the math. joint distribution uh sorry this is the conditional distribution and this is simply equal to the joint divided by probability of C and that's equal to probability of A given C * B equals given C which is literally the definition of uh conditional independent okay so here A and B are conditionally independent given Okay. And intuitively you can think about um it as let's say there's a coin flip and then A does some stuff and then B they only depend on that coin flip. So condition on knowing the coin flip then A and B are doing sort of different you know independent things. Okay. So now let's revisit example three. So here I want to ask the question um are A and B independent given C? Okay so remember A and B are independent. What about given C? How many of you think they're is still independent? And they're not independent. Okay. So now hopefully you're anchored to the burglar alarm earthquake example. And you see in math that um well this whole thing is just a kind of a mess. You see that C and A and B are all combined in this arbitrary

### [1:10:00](https://www.youtube.com/watch?v=Dk7Kqqehzjk&t=4200s) Segment 15 (70:00 - 75:00)

function here. So um it definitely is not the case that A and B are conditionally independent given C. [snorts] Okay. So these so this is uh this is a no. Okay. So these two are examples you should kind of like you know really remember in your head because there's are like very similar but yet they follow different um follow differently. So in the above A and B are marginally independent but conditionally not independent where it's the opposite here. A and B are marginally not independent but when you condition on C they become independent. Okay. So this is a good primitive to have in your head when you're trying to reason about these things. Okay. So let's now go to the alarm example. Um hopefully this should be old news by now. So here B and E are independent but not conditionally independent given A equals 1 which is how you get explaining away which we saw in uh last lecture. Okay. So in general how do you tell if I just give you a huge basian network and there are some variables. Um here is a general algorithm. Okay, so A and B are they independent given C? First you shade in the variable C. So you can think about these as when you condition on something, it's kind of like blocking a path. Okay, that's intuition. You recursively remove any non-shaded leaves. So this is to deal with the fact that um A and B are really independent but there's some things here that makes it look like it's not independent. So you just kind of remove that. Um this is just uh I mean you don't have to literally remove it from the picture but in your head you usually think about it not being there. [snorts] Um now you connect the parents to each other. Um you can think about remembering it as like marrying the parents. Um and then finally the answer is return whether there's a path from A to B that doesn't go through any shaded nodes. Okay so uh let me do a quick example here. So I have this uh medical diagnosis example from last time. So we have cold allergies um cough and itchy eyes. Okay. So I can go through a bunch of examples. Um so are C and A you know independent? And the answer is yes because this is a I condition there's nothing to condition on. I can essentially remove this and remove that because they're unshaded no leaves. And then I'm just um left with C and A. And uh there is no path that goes between them. So they're independent. What about C and I? C and I. Okay. Again, I can I can't remove this. I can remove H because H is a leaf that's not shaded. And then I get basically C and this subgraph A and I. And there's no path between them. and therefore they're independent. So I given a equals 1 or a equals z whatever. So if I shade that and I'm asking are these independent or not um these are you know independent because there's no path that goes like this is essentially blocking off the way to go over there and you know similarly if I shade H um then these two are also kind of blocked Okay. So, um you know, maybe we'll do more of these examples uh um next time. Um but hopefully that's just a taste of how you think about conditional independence in the context of Beijian networks. So, um just a summary. So we talked about probabistic inference the ability to answer queries on this database uh baser network and uh we can do exact probabistic inference. We just form the distribution uh joint distribution and we can marginalize and we can condition um this is exact but its complexity can be exponential in the size of the number of nodes in the

### [1:15:00](https://www.youtube.com/watch?v=Dk7Kqqehzjk&t=4500s) Segment 16 (75:00 - 75:00)

Bayian network. Then we look at two approximate bas proic inference method rejection sampling gift sampling. Remember um they have kind of complimentary uh strengths and weaknesses. And then finally we started looking at conditional independence which assesses whether two variables are independent given evidence and we're essentially able to reduce that to sort of graph properties. are these two variables connected after a bunch of processing on the graph. Okay, so next time um we here so far we've just assumed that the Bayian network is given and we're answering questions. Next time we're going to talk about how you actually learn the parameters of the Bayian network. How do you get these probability tables?

---
*Источник: https://ekstraktznaniy.ru/video/20908*