# Student's t-distribution in Statistics

## Метаданные

- **Канал:** Steve Brunton
- **YouTube:** https://www.youtube.com/watch?v=kQoPUR0hQNo

## Содержание

### [0:00](https://www.youtube.com/watch?v=kQoPUR0hQNo) Segment 1 (00:00 - 05:00)

welcome back okay so today I want to tell you about a really important distribution called the students T distribution which is particularly useful when you're going to you do hypothesis testing but you have a small sample size n so we know that if we have a relatively large sample size n then by the central limit theorem the sum of random variables of a bunch of identical random variables will converge to a normal distribution but for small n there's kind of a correction that we need um and this is going to be related to the students T distribution so I'm going to derive kind of why we use it I'm going to show you in Python that as n increases the T distribution actually converges to the normal distribution you would expect from the central limit theorem um and that will be useful for hypothesis testing uh T distribution is actually useful for lots of other things but I'm thinking about it as something you need to know if you're going to do hypothesis testing and you have a small n often times instead of a normally distributed test statistic you're going to need a t distribution distributed test statistic okay so you're going to want to set up your rejection region based on this distribution instead of a normal distribution if you have a small sample size okay let's get into it and I'll point out um student the student T distribution um this is actually um this paper where this was introduced was published under a pseudonym student and that's why this name stuck okay so let's jump in um so let's talk about this in the context of a hypothesis testing problem so imagine we're testing a hypothesis so uh we are testing a hypothesis and let's say that the hypothesis is a simple hypothesis that after some manipulation or some treatment the mean of my population has changed we've done this before we've looked at this kind of basic hypothesis test so we're testing the hypothesis um that after some uh kind of manipulation or modification some manipulation to my system the new parameter the new expected value is different than the old expected value um the mean of the system me has changed okay so for example um maybe I have a factory and I'm outputting parts and those parts have a certain you know expected success rate um of not failing um or not being recalled or whatever and I do something hopefully to improve the yield of my factory that would be a hypothesis I'd want to test did that man population or modification actually change the mean of that process okay and so what we would do is we'd set up a hypothesis test based on the old mean and the new data we'd collect data from the system after the manipulation and we test that it was the same or different than the previous mean okay so what we would do is we would collect new data we collect I'm just setting up this T distribution by reminding you of what a hypothesis test looks like we collect data um X1 dot dot to xn and this is going to be a small n for the T distribution to matter we collect data uh and what we do is we collect a sample mean and uh compute xbar the sample mean which is 1 / n sum from I = 1 to n of each of these data points that we've collected and this sample mean is our new best guest from this data set of what the mean of the system is what the new mean mu is and so the hypothesis that we're going to test is does this data come from uh the a system with the old mean or do I refute that hypothesis and assert that this mean has changed that this does not come from that distribution and so thus the mean mu has to had to have changed okay um so by the central limit theorem this um so we're almost to the T distribution by the central limit theorem uh xbar is going to be distributed as a normally distributed uh random variable with mean mu and variance Sigma 2 over n um if nothing changed if mean didn't change did not uh change and so finally what this means is that

### [5:00](https://www.youtube.com/watch?v=kQoPUR0hQNo&t=300s) Segment 2 (05:00 - 10:00)

we can actually take our data and we can compute this mean and we can see where that mean lives with respect to this normal distribution and we can build a hypothesis test we can build rejection regions based on this normal distribution um and essentially test our hypothesis um of whether or not the mean did or did not change okay so the hypothesis um would be that these data is from this mean value so the null hypothesis H knot is that the means the mean didn't change not change and so we set up a test statistic so the test statistic for this hypothesis is X bar minus mu divided by the standard deviation or sorry the standard error Sigma over root n this should be just a complete recap for you we take this um this new sample mean we subtract off the old mean divide by the old you know uh standard deviation divided by the root of our sample size and this is a test statistic that should be normally distributed with mean zero and standard deviation one so we can use this for hypothesis testing the issue here is the following we probably know me because we're trying to assert the hypothesis did mu change did the mean change so we probably know this value we know this because this is the thing we're testing did it change or not but we might not know the variance Sigma of the actual underlying true distribution we might not know this is like the variance of the actual full distribution that we don't know this is unknown probably if it is known just use this formulation if it's unknown if this Sigma is not known then we have to do what's called bootstrapping so in that case we have to bootstrap and we replace this unknown standard deviation with the standard deviation of our data What's called the sample standard deviation and so that's what I'm going to write down here so if we uh if we don't know this then what we do is we replace this with xar minus mu over this thing called SN / root n where SN uh essentially SN squar is the sample variance it's the variance of um all of my data which is equal to the sum from I = 1 to n of x i - x bar^ 2 okay this is just the definition of the sample variance so let's just zoom out okay we're doing a hypothesis test that the mean changed data did not change and by the central limit theorem we've derived all this before this is the test statistic we usually use in all of my previous lectures we've used this test statistic the new mean minus the old mean divided by um the standard deviation over root n that's normally distributed you can do hypothesis testing you can literally build um you know from your normal distribution you can build a rejection region let's say I want a 5% you know 05 P value rejection region and you can see does this test statistic live in that rejection region or outside of that rejection region if it lives anywhere else then that means the mean did not change but if it lives in this rejection region then the mean did change probably within that significance that statistical significance but this assumes that you know the variance the standard deviation Sigma of the actual underlying distribution of the system we often don't have access to that and so we have to estimate that Sigma from the data we collected so if you that's called a bootstrap estimate and you replace this true Sigma standard deviation with a sample standard deviation computed from the sample variance this distribution here this variable is distributed as the T distribution with n degrees of freedom or t of n sometimes we say this is T with parameter n degrees of freedom okay so that was a lot to get to introduce the T distribution but what this means is that if you don't know the variance the true variance Sigma uh sorry standard deviation Sigma you have to replace it with the bootstrapped sample standard deviation and that is distributed as a t

### [10:00](https://www.youtube.com/watch?v=kQoPUR0hQNo&t=600s) Segment 3 (10:00 - 15:00)

uh distribution now proving this is quite challenging I'll show you kind of why this is true in the next video like why this is a t distributed random variable we'll talk about why that's true but just for now you have to know that if you use the bootstrap standard deviation you actually have a t distribution not a normal distribution now the good news is for large n for moderately large n like n30 50 100 anything kind of bigger than 30 this distribution starts to look so close to a normal distribution that you can just kind of ignore this and use our normal easy test statistic up here but if you have small n and a bootstrapped variance or standard deviation you have to use the T distribution okay so now what I'm going to do is I'm going to show you a python code that actually just computes the T distribution for lots of N and plots them against the normal distribution so you can see the convergence then we're going to write down the actual probability density for this T distribution and then we'll conclude with some parting thoughts um really what I want you to know though is if you're doing hypothesis testing and you have a bootstrapped standard deviation or variance small n you need to use the T distribution that's the upshot okay let's now plot that the T distribution converges to the normal for large n okay so this is a pretty easy uh python code essentially what we're going to do is we're going to generate a bunch of T distributions with different degrees of freedom so one degree of Freedom uh two degrees 5 10 30 100 and we're going to plot all of those student T distributions against a standard unit normal that we think is a good approximation for Big N so we're going to do this uh we're going to run it and essentially this is the plot with good colors this shows so kind of um the white dashed curve at the very top is the standard normal distribution and you can see from kind of dark blue up to lighter yellow as n increases the T distribution gets closer and closer to that normal distribution okay so again what this means is that for large n even if I have to bootstrap the standard deviation or the variance I'm probably fine using a normal approximation but for small n there's a big enough difference I might get into trouble so I need to use the T distribution if I have small n and I'm bootstrapping the variance good that was a really quick demo uh and so now kind of parting thoughts here um I guess I should actually write down the probability density for this T distribution but I will uh write down the thing that I think is really important which is how you use this for hypothesis testing so let's say I have um my gaussian distribution in pink and let's say I have my T distribution for a small n in blue it's somehow looks kind of like this it's got fatter tails and a lower um you know kind of middle of the distribution so blue this is my T pink this is my normal okay this is just a rough rough sketch if you're doing a hypothesis test where you would normally build like a one-sided rejection region on the normal distribution but you have a small n and you're bootstrapping your variance for your test statistic then you need to Define your rejection region based on this T distribution this blue uh distribution here because especially where the rejection region lives that's where the tails are where these distributions disagree the most so for small n and bootstrapped variance in your test statistic you have to use the T distribution for that um that hypothesis test to reject that hypothesis I realize I completely fli the colors Here Pink here is T whereas blue here is T and vice versa but you get the idea okay um let's actually write down the PDF for the T distribution so the probability density function uh with n degrees of freedom um this is a t is the following I'm going to write this down it's going to be a s we're going to come back and show kind of roughly why this is the way it is um it's equal to this gamma function evaluated at n + 1 / 2 remember gamma is related to factorial uh specifically gamma of an integer is equal to n minus one factorial for integer for integers

### [15:00](https://www.youtube.com/watch?v=kQoPUR0hQNo&t=900s) Segment 4 (15:00 - 16:00)

but this gamma function generalizes the notion of a factorial to any number you can take the gamma of Pi or 1/2 or whatever and it generalizes the notion of a factorial this is a pretty complicated function it's recursively defined based on an integral but in statistics it comes up all the time and just remember it's a function you plug in a number you get out another number it's this divided by root n pi gamma of n / 2 we essentially need this to normalize the probabilities these are just numbers this is doesn't really matter this is a constant scaling Factor the thing that actually matters is 1 + t^2 Over N to the minus uh n + 1 over 2 okay and I'm not going to prove this but if I take the limit as n goes to Infinity this starts to look a heck of a lot like e to the something this looks like an e to the you know t^2 it's going to look a lot like a normal distribution a standard unit normal centered at zero so already this is just a normalizing factor to make sure the probability adds up to one this is already looking like as n goes to Infinity this is going to converge to the normal distribution you could actually write this down and prove that would be a nice easy thing you could do um but this is the distribution you have to use for hypothesis testing when you have a bootstrapped variance for your test statistic and you have a small n okay thank you

---
*Источник: https://ekstraktznaniy.ru/video/44508*