Density Estimation with Gaussian Mixture Models (GMM) and Empirical Priors

Density Estimation with Gaussian Mixture Models (GMM) and Empirical Priors

Machine-readable: Markdown · JSON API · Site index

Поделиться Telegram VK Бот
Транскрипт Скачать .md
Анализ с AI

Оглавление (4 сегментов)

Segment 1 (00:00 - 05:00)

welcome back so we're talking about beian inference this notion that we want to estimate the parameters of some distribution from data X so the parameters uh Theta from some data X but we also want to incorporate our prior belief about the distribution of our parameters in this prior function P of theta it's a really nice way of balancing uh some partial knowledge of my system and new evidence from data and we've looked at this a lot um in the context of probability distributions where we know the likelihood function it's a nice well- behaved uh distribution like binomial or normal and we can choose a good prior distribution to encapsulate our beliefs so that it still plays nicely with this likelihood function uh this notion of a conjugate prior if you remember that and we looked at this for simple examples like um the process of a coin flip where the likelihood function of getting H heads out of n coin flips is a binomial distribution and a good conjugate prior for that binomial distribution is a beta distribution that's how we encode our prior beliefs is through some parameters of a beta distribution and we had this nice property that when we multiply our likelihood and our prior together we get a posterior update that is still in the beta distribution family and so it can become the prior for the next round of data collection I collect more data X I update my prior that's still a beta distribution so it becomes the prior for the next round I collect more data I update my prior into this new posterior so this is the posterior and so on and so forth that's the basic procedure of beijan uh inference but up until now we've assumed that all of these distributions belong to nice well-defined named families of distributions binomial normal gausian things like that in most of the real world the density functions my P of theta or given X or any of these are going to be pretty nasty poorly behaved functions that don't have names they're not going to be normal or binomial for lots of cases we care about especially in machine learning and so what we're going to talk about today is this idea of density estimation instead of assigning these to some family of named distributions we're going to estimate these probability densities essentially using a fancy curve fit we're going to approximate uh empirical prior functions and empirical posterior functions by estimating their PDFs from data essentially by summing up a few basic simple distributions usually gaussians to approximate these density functions so what do I mean um let's say my density is really nasty you know it's some kind of funky you know curve like that so that's my probability distribution there's not like a named function that will give me that where I have a nice conjugate prior and so instead we have to use this empirical prior using density estimation to approximate this PDF and then you know do our Bean update using these approximated PDFs so I'm going to show you some code in a minute for a very specific class um but here just going to name two of the most common families of um of density estimation tools there are a ton but these are two of the really common ones one of them is called kernel density estimation uh KDE for short and kernel density estimation essentially treats every data point in X like a little kind of collocation point that gets its own little gaussian or its own little kernel and you smooth between those to approximate kind of a smooth distribution from a bunch of data so the way we actually do this we would say um the probability of X of my random variable big X equaling some little value Little X is equal to uh I'll just write it out one over and H times the sum uh from I equal 1 to n of some kernel function K of this value minus my I data point divided by some uh scaling Factor some smoothing factor H so I'm going to write some little notes here um K is my kernel usually a gausian um I think you can kind of generally speaking um gausian functions are the easiest to work with gaussians are their own uh

Segment 2 (05:00 - 10:00)

conjugate prior so a gaussian likelihood and a gaussian prior work well together and so if I can add up a bunch of gaussians to approximate these functions that's going to work nicely too so usually the kernel is a gaan it doesn't have to be but it often is H is a smoothing width essentially um this is a hyperparameter you get to choose to kind of smooth out your distribution smoothing width this is a lot like diffusion Maps where there's some smoothing factor and then XI are all of my actual data samples I I have like n Big N samples of data x i and the probability of some new sample being at some value Little X you plug in Little X here and these are all data from before okay um and then this is my smoothing factor my number of data points n so n is the good so this is just a function it's basically a sum of a bunch of kernels usually gaussians of how far so the way you can think about it is every time I've seen a data point XI before that makes it more likely to see a new Point close to that data point in the future that that's kind of what we mean by a frequentist view of probability and so if I have a bunch of data clustering around some point then the likelihood of X being near those points is going to be higher and higher so this is a probability of finding X near points I've already seen in the past that's what this really says okay that's what kernel density does um the other and this is what I'm going to use for my python example to do density estimation uh in a few minutes another big category is what we call gausian mixture models okay so another big category is uh gausian mixture models um you've almost certainly heard of these before gausian mixture models uh GMM and these GMM similarly so this is kind of almost like a special case of a gausian mixture where you have as many Gauss as you have data points here we're trying to get many fewer gaussians to approximate my distribution so the rough picture of this looks like the following if my PDF looks like uh let's say this I can probably approximate that as the sum of a few gaussians maybe as few as like three gaussians so my PDF if it's kind of smooth and kind of you know bumpy I can usually approximate it as the sum of a small number of gausi here I have as many kernels or gausi as I have data points so like big n if I do 100 coin flips I'd have 100 little gausian I'm averaging over down here in a gausian mixture model I have way fewer gaussians ideally to approximate my probability distribution um and this is a really nice uh very general class of function approximators um gausian mixture models and it generalizes nicely to higher dimensional uh higher dimensional parameter spaces Theta this is kind of a probability over a uh scalar you know valued Theta but if I had a higher dimensional context so let's say maybe I have um data let's see if I can actually draw some data okay this is my data just going to try to make some data here if I had some data in a higher dimensional space I can still approximate this as the sum of a few gaussians here gaussian maybe gaussian one gaussian 2 gaussian three and gaussian four okay and that's the idea of a gausian exture model now training these on the data finding the weights of these gaussian and their means and variances finding the specific um mixture of gaussians that fits your data that's an optimization problem we might do something like expectation maximization which has a beian analog and a maximum likelihood estimate Point estimate analog so this is a uh a class of functions to approximate my density and to find the specific number of gaussians and their parameters that would be an optimization problem so you solve this with optimization you solve for fit with optimization uh for example the em or expectation maximization algorithm uh which we'll talk about later we we'll probably have a whole lecture on Em as it relates to gausian mixture models and we'll talk about the kind of bean version and the kind of nonan version of

Segment 3 (10:00 - 15:00)

expectation maximization lots of optimization tools to solve for this lots of ways of estimating your density when you don't know closed form functions for these are just a couple of examples here this one is kind of nice because I don't really need to do this optimization to find the distribution the distribution is defined by the data points that's kind of a nice property here and there's dot dot lots more of these are just two common examples okay um so I think I want to actually give you an example now what we're going to do is we're going to go back to that case where we're flipping a coin um so we are trying to use Bean inference to infer the kind of probability of the parameter associated with this coin Theta which is the probability of it being heads or tails so a Fair coin would have theta equals 0. 5 and what we're going to do we know that there is an easier way to solve this problem because we know that the likelihood is binomial and we know that a good conjugate prior is beta so we know that that's a good way to solve um the beian inference and update for a coin flip process but let's pretend we don't know the likelihood function and the conjugate prior let's see if we can get the same kind of answer using density estimation using this much more General tool that works when I don't know these distributions but let's try it on a case where I know the answer uh this coin probability good so we're going to fire up our um our uh python example here and let's see so this is very similar to the case I've shown you before to the code before except now what we're doing is we're doing this um gaussian mixture model uh sorry not a gausian mixture model a kernel density estimation so here we're doing kernel density estimation using a gausian kernel and this bandwidth parameter is related to H it's kind of the how smooth you make everything and what we're going to do is we're going to be estimating these distributions using our kernel density estimate okay so instead of binomial and beta we're going to use this KDE approximation to these distributions and there's going to be some update that's related to this um and in fact um you essentially have let's see um yeah you're going to update your kernel density estimate every time you get a new data point by modifying this function essentially okay so that's a really nice easy way of encoding your prior is by updating this like literally just updating this representation of the prior by adding more data points okay and so we're going to run this um and just see what it looks like and so you know the dark kind of purple are the earlier coin flips less evidence X and every time I flip a coin I gain new evidence and we update this KDE this caral density estimate it starts off being pretty wonky but with more and more evidence you see that this posterior is getting closer and closer to a unimodal distribution with a Peak at uh theta equals 0. 5 it's not perfect but it's getting actually pretty close the posterior mean is pretty close to a Fair coin 0. 5 it's not perfect and it's pretty wobbly it's kind of ugly but if you do this procedure if I flip a 100 coins if I get more data this gets better and again the smoothing parameter matters a lot so I'm gonna ask you to take this code and play around with that smoothing parameter I think here it's 0. 5 try 0. 05 try 0. 01 try 99 see how it breaks or how it works um if we have a 100 coin flips this was just like eight coin flips if I have 100 coin flips it's a slightly more complicated code because I'm doing some cool plotting here but it's the exact same uh KDE approxim very simple update to the posterior um and now we can use this nice movie to watch our uh our P of theta update as we acquire more and more evidence X okay so here orange is the distribution blue um we see the mean of the distribution and plus or minus one standard deviation and down here we're plotting the posterior mean with its standard deviation so um very quickly it starts to actually look like a nice well behaved almost unimodal distribution again because all of these heads and tails are kind of averaging out and giving us a nice uh a nice balance here if we have enough a big enough n this kind of Smooths out and you get a pretty decent approximation to a unimodal single Peak uh density function with the the mean value of theta around the correct value of 0. 5 it's not perfect

Segment 4 (15:00 - 18:00)

but it's pretty good um it's not as good as if you know the distribution and its conjugate but if you don't this is a pretty good approach using something like kernel density estimation or a gausian mixture model um for your prior and your posterior okay um kind of a cool example it's really that that's it it's simple um and you can go through the code and figure out like exactly how we uh updated the prior um sorry updated the posterior with this evidence you can go through and kind of convince yourself that this all makes sense and that it's kind of consistent with what we did before but instead of known distributions now we're using these empirical distributions for the prior and for the posterior pretty cool idea um what else do I want to tell you um yeah so we'll talk more about this how to estimate this product when these are even nastier functions that are hard to approximate um using these methods we'll talk about how to do Monte Carlo approximations and bootstrapping which is really powerful um especially for higher dimensional um parameter spaces and more complex Dynamics um one last point I want to mention and this is just a passing note we'll talk more about this later is that empirical distributions empirical uh distributions like what we just showed empirical distributions essentially are generative models these are generative models so we've been thinking a lot about generative AI everybody talks about this now especially um because of you know chat GPT and on all of the llms that are generative models um before that we were looking at things like um Dolly and mid Journey that generated beautiful images from a text prompt generative AI has made huge advances in the last five or 10 years and empirical distributions these approximate fun functions for my probability distribution essentially are generative models if I have one of these empirical distributions I can essentially generate data that is according to these probability distributions I can generate data that will follow these PDFs um because I have I know what the probability distribution of finding data at some value is so I can use that as a generative process so empirical distributions are a form of generative model there's much more sophisticated General generative models now and so these aren't exactly equal to each other but everything I'm doing here allows me to generate new synthetic data from old historical data I can generate a big Ensemble of new synthetic generative data so that's kind of interesting too but the headline here is for most problems of interest in machine learning Advanced statistics problems these distributions are nasty and are not going to have names and be well-defined mathematical expressions and so we often have to resort to density estimation to approximate these in terms of simpler kind of basis uh basis functions and distributions okay thank you

Другие видео автора — Steve Brunton

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Дайджест Экстрактов

Лучшие методички за неделю — каждый понедельник