# Inference for one mean: Worked examples and p-value interpretation | Full lecture (Intro Stats)

## Метаданные

- **Канал:** jbstatistics
- **YouTube:** https://www.youtube.com/watch?v=sHenEwSGgLo
- **Дата:** 01.03.2026
- **Длительность:** 50:26
- **Просмотры:** 1,185
- **Источник:** https://ekstraktznaniy.ru/video/52878

## Описание

In this full lecture-style video I work through two real-world examples of hypothesis tests for a single mean. The first example uses a fixed significance level approach; I then discuss why that is often not the ideal way to report results, motivating a transition to reporting and interpreting p-values directly. Between the examples I discuss the behaviour of the p-value under the null and alternative hypotheses and give rough guidelines for interpreting the strength of evidence for various p-value ranges.

The second example illustrates drawing conclusions without a fixed significance level. In both examples I also discuss the relationship between the hypothesis test and the 95% confidence interval.

The examples use real data: a study on melatonin onset delay caused by blue light exposure through closed eyelids, and sodium content in fast food chicken nuggets.  References below.

Students in my STAT*2040 course at the University of Guelph have the accompanying lecture materials. I wi

## Транскрипт

### Segment 1 (00:00 - 05:00) []

Hello everybody. Today we're going to work through two examples of t-tests on a single mean. We've already discussed the foundation for all of these tests and now we're going to look at examples. In this example, we have a study investigating possible effects of pulses of blue light through closed eyelids on melatonin suppression. So see the description for the full citation here, but this is real data, real study. And in the study, 16 subjects had their dim light melatonin onset measured in dark conditions one night and with blue light pulses through closed eyelids on another night and this phase shift was recorded. That phase shift being the difference between those two things and uh this negative phase shift a negative phase shift indicates a delay in melatonin onset under blue light exposure. Your melatonin onset was pushed back, thereby possibly impacting your sleep. So we'll look at the data here in a minute, but we are going to go through and want to carry out a test. There's a very reasonable test we might want to carry out here. So big picture-wise, we want to investigate whether these blue pulses do anything as far as this melatonin onset and but what happens to this the this phase shift under blue light pulses. And so the null hypothesis would be casually phrased like nothing. Those blue light pulses aren't doing anything. And uh the alternative hypothesis would be that well, they're doing something. We want to see if we have evidence that these blue light pulses are doing something. But we have to be a little bit more specific about all of that. And in this case, I think a reasonable thing to test is the null hypothesis. We'll test the null hypothesis that uh mu we're going to let mu equal the true mean phase shift uh and the null hypothesis is that is zero. The true mean phase shift is zero. And our alternative hypothesis, well, that that going to be that the null is wrong, but we can phrase this in different ways and various ways, but typically two-sided alternative is more appropriate here. We're going to test the two-sided alternative hypothesis that mu is not equal to zero, and that will allow us to detect a difference in either direction, whether mu is greater than zero or mu is less than zero. The researchers probably suspected going in here that if blue light pulses through closed eyelids did anything, it was that it was going to push back the melatonin onset and thereby making the phase shift uh tend to make the phase shift negative, cuz they have some idea what going goes on with these blue light pulses, but we would be interested in a difference in either direction, or at least I would argue that we would be very interested in a difference in either direction. If we saw something counterintuitive in the opposite side, we'd want to publish that. This would be interesting. So, a two-sided alternative will allow us to detect a difference in either direction. I think that is the most appropriate one. And let's say we're going to do a typical thing and say let's carry out this test at an alpha level of. 05. Suppose suppose uh you know, we're writing this up for a journal article or thesis and your advisor wants you to just straight-up do the testing at an alpha level of. 05. This is of course an arbitrary thing and we'll talk about how to do a test without a specific alpha level in the second example, but for this one let's say we think an alpha level of. 05 is reasonable and that is what we are going to do. So, we're setting all of this up before even looking at our data. Remember again, you should be able to construct hypotheses before ever looking at your data. You know, from the nature of the problem, problem that we're investigating here. We're trying to see if these blue light pulses do anything in terms of melatonin onset. So, let's have a look at the data. And here we have my plots, my plots that I very much like. we have this jittered box plot on the left and normal quantile-quantile plot on the right for assessing normality. But first, let's look at the jittered box plot. Regular box plot with the points overlaid here. The vertical values here, the values on the vertical axis are the phase shift. So we have this phase shift in minutes and we have these 16 values uh for these phase shifts. So a phase shift, the smallest one down here minus 85 or so, something down there. Uh and then the biggest one up here in in the 30s and 16 values cuz we had 16 observations of phase shift on each individual. And I put in here the hypothesized value. The hypothesized value of zero. Put a little blue line, blue dotted line representing that hypothesized value. So when we're looking at this right away, I think we can see there looks to be some evidence against the null hypothesis. Visually to me, it looks like there was something systematic going on here. That this isn't just random variability at work. That these values have a tendency to be negative. That's just visually and casually, that's what it looks

### Segment 2 (05:00 - 10:00) [5:00]

like here. Uh also on these box plots, I put in a little red dash representing the sample mean. That's why your little red dash is here. In some ways we're going to look at this difference between the sample mean and this hypothesized mean that plays a big role in our test statistic. Uh and we're just overall seeing whether this is a statistically significant difference here. And visually it looks like there's something going on. That's one major thing to take away from this box plot. I think visually it looks like there's something going on. Our hypothesized value is at zero here. Uh we got a lot of these values are are negative. It looks just looks to be a substantial difference between the sample mean and the hypothesized mean. So I think that there's going to be substantial evidence against the null hypothesis and giving us evidence that the true mean is in fact less than zero, but let's see what the actual hypothesis test has to say. We only have 16 observations. The normality assumption of the T procedure is very important for 16 observations, so we should investigate that. From the box plot itself, yes, things are looking pretty good. Box plots don't speak directly to normality, but things are looking pretty symmetric here and the big enemies of the T procedure are strong skewness and outliers and we don't have anything like that kids here. So, from the box plot it's looking pretty good, but the normal quantile plot is designed to help us investigate that a little better and recall that if we are sampling from a normally distributed population, then the points on a normal quantile plot will tend to fall along a straight line. And there's some random deviation, of course, even when we're sampling from a perfectly normal population. There's some random variability. It takes some experience to say that I got to get accustomed to these and see how much deviation there needs to be before we can start saying, "Normalities questionable here. " But in this spot, these fall very just very close to the line. This is very very nice looking normal quantile plot in the sense that it looks like our sample data is approximately normal. Reasonable to think that our population might be at least approximately normal and that all looks pretty good. So, just the normality assumption of the T procedure looks pretty reasonable here, so we can go ahead and carry that out. The sample mean phase shift was minus 30. 7 minutes. I'm leaving that just to one decimal place. It's actually exactly minus 30. 7 point 70000. I'm giving the standard deviation here of the sample to four decimal places just to remind us we should be carrying a lot of decimal places through our calculations here. This is the standard deviation of this of sample standard deviation, right? The sound the standard deviation of these 16 phase shifts that we had up here. We don't know sigma, right? We don't know the population standard deviation of the phase shifts. We don't know this, so we can't carry out a Z test here. We are going to use the sample standard deviation in the test statistic, which means we're going to be using a T test. So, let's go ahead and see what happens. And again, this is this red bar here this little red dash on the box plot is the sample mean, so we're really looking this difference here plays uh a big role in our test statistic uh and also the standard error of the sample mean. So, let's work through the test statistic here. Value of the test statistic. I wrote it all like this at first to remind us or give us a little foreshadowing of what is to come because a lot of test statistics are of this same general form. We have our estimator of the population mean mu, which is X bar. Then, we subtract off our hypothesized value of our population mean from our null hypothesis, which zero in this case, but in general we subtract off the hypothesized value of the uh the parameter that we're testing the value of, and then we divide in the bottom here by the standard error of this estimator. So, this is a form that we see a lot in statistics in many different places. And in this case, the we have X bar minus mu naught and the standard error of X bar is S over root N. So, now at this point, we can just put in our values. Sample mean minus zero over S over root N, and we get a value of the test statistic of minus 4. 0656. So, that is the value of our T test statistic. Now, what does that do for us? I've plotted in over here a T distribution. So, this is a T the T distribution and uh with certain degrees of freedom. And the degrees of freedom for one sample t test are n minus one. That's not always the degrees of freedom for a t test, but it's the degrees of freedom here for a t test on a single mean. So, degrees of freedom are n minus one, and we had 16 observations. So, 16 minus one, we have 15 degrees of

### Segment 3 (10:00 - 15:00) [10:00]

freedom. And I have plotted here the probability density function of a t uh distribution with uh 15 degrees of freedom. Why is this relevant to us? Well, if the null hypothesis is true, this is the distribution of our test statistic. We also need the assumptions to be true, like normality. But, if the null hypothesis is true, this is a the big concept here. If the null hypothesis is true, this is the distribution of the test statistic, which means that if the null hypothesis is true, the value we actually got should but will just be a random sample from this distribution. So, how hypothesis testing works, we ask ourself, is this a normal everyday value to get from this distribution, or is this a weird value to get from this distribution? And the value we got is out here. So, we got this value out here at minus 4. 0656. Visually, that does look like a weird value to get from this distribution, I would say. This is pretty far out in the tail. So, that's Had we gotten a value near zero, something in here, minus. 4,. 32, you know, that sort of idea, that would not give us any evidence against the null hypothesis because that would have been a very ordinary average value to get from this distribution. But, we got something out here. So, something out here, which does look like a strange value to get from this distribution, a value pretty far out in the tails, thereby giving us evidence against the null hypothesis and in favor of the alternative hypothesis. But we need to quantify that. how far out in the tail that is. We can't just be looking at this and be saying, "Hey, I think that's pretty far out. " You saying, "No, I don't think it's that far out. " We need to quantify this. And how we do that is through the P value. And recall, this is important here when we're trying to figure out our P value, we're testing the null hypothesis that the true mean phase shift is zero against the alternative differs from zero. And we were doing all of this, uh like we said up top, at us at an alpha level of 0. 05. So all of that is relevant here. For a two-sided alternative, the P value is double the tail area beyond the observed value of the test statistic. So we're getting this, we're saying, "Hey, what's the probability of getting what we got, which was this value, or something farther out? " That's the general idea. So this area out to the left of -4. 0656 is important to us. But with this two-sided alternative, we would have thought this value on the other side of zero would have given us just as much information against the null. So went through this argument when I talked about P values the first time, but the general idea is for this two-sided alternative, we take the tail area, little the little area in the tail beyond the observed value of the test statistic, and double it, which means that the P value is double this area. Okay, the P value of this test is double this area. So we need to find that. Then we going to we're going to need software to find that. That's how this works. So we're going to need software to find this. So we want to find the area to the left of -4. 0656 under a T distribution with 15 degrees of freedom and double it. And we are going to R for that. So we would find the area to the left of this value by using the PT function in R, because PT gives us the area to the left of the input value. So, PT -4. 0656 and 15 degrees of freedom. But, we don't just want that area. We need to double it. And so, in the end, for our two-sided alternative, our two-sided P value is 0. 0010. Or about one in a thousand. This is a pretty small P value. And later we're going forward after this example, the vast majority of the time we're going to simply look at the P value and assess the strength of the evidence against the null hypothesis. But, when we are doing a test at a fixed alpha level, which we decided before we were going through for whatever reasons we're doing this, perhaps we're writing it up for a journal and they need us to write it up in this way, say. Or your advisor at work or school wants you to do it this way. So, we would have this fixed alpha level, we put the blinders on. Any nuance in our decision-making is out the window and doesn't matter anymore, — right? We simply We could say we reject the null hypothesis in favor of the alternative if the P value is less than or equal to our alpha level. Or we could say the evidence against the null hypothesis is statistically significant if our P value is less than or equal to our alpha level. So, in this case we get a P value that is less than 0. 05. So, we could say that the evidence against the null hypothesis

### Segment 4 (15:00 - 20:00) [15:00]

is statistically significant. We could also say we reject the null hypothesis at the 5% level of significance, but I hate that terminology in something like this, because we are in this case investigating how much evidence we have of an effect of these blue light pulses. Nothing is actually getting rejected. We're not officially making a binary decisions here in the real world in a problem like this. Uh so, I don't love that uh reject the null hypothesis terminology here, but it is commonly used. So, I would say it's better to say that we have a statistically significant evidence against the null hypothesis. But, once again, we want to tie this in to what this actually means. What this means. And not just in symbols here. What do we have? We have strong evidence or statistically significant at 0. 05 against this null hypothesis and in favor of this hypothesis. Uh with mu representing the true mean phase shift. So, we can carry this out and say we have statistically significant evidence at alpha level of 0. 05 that uh the true mean phase shift differs from zero. We have statistically significant evidence against the null hypothesis and in favor of the alternative hypothesis. This is fine and and real. This is what happened here. Uh but, we are always interested in the direction of the difference. difference is very important to us. And at this point, we get into some statistical nuance and some thing points that are debatable and argued at times. But, if we were to just say that there's a difference, that doesn't really make any sense. — It makes little sense without talking about the direction of the difference. We got a value way out in the left tail. tail because our sample mean was quite a bit less than our hypothesized mean. We are looking at this, it looks like here, right? Because our all these values tend to be negative, our sample mean is quite a bit less than our hypothesized value, leading to a test statistic that is negative and far out in the left tail, meaning that we have strong evidence that the true mean phase shift is in fact less than zero. So, this is an essential component of this. We are essentially always interested in the direction of the difference, and it's important to think of that when we're doing carrying out these tests. Now, there are a number of ways we can phrase all of this depending on the context, so let's look at a few of them. This is my grid. Seeing what we can say in different spots or what it tends to be said in different spots. What the results mean. What these results mean when we're actually carrying out this test. We had strong evidence. It was statistically significant at that. 05 level that we chose. There was strong evidence that the true mean phase shift is negative. And what that meant in in the context of the studies, it's strong evidence that those blue light pulses actually caused a delay in melatonin onset. Those blue light pulses through those closed all eyelids, we have strong evidence that caused a delay in melatonin onset. So, more formally, the the evidence that the true mean phase shift is negative, but a consequence of that is this what it meant in practical application. When we were if we're writing this up for a journal article, we might see something like this. Journal articles don't like maybe in the discussion they talk about the meaning, but when they're just talking about the results of the test, they just bang out test and assume that the reader knows what it means. So, in a journal article, they very typically do say something like they're going to carry out all testing at a 5% significance level. It's very common in journal article reporting. Hey, we're going to test everything at a 5% significance level. And then that allows them to talk of statistical significance. And so here they might say the mean phase shift was significantly less than zero, but additionally and very wisely, we typically put in the value of the test statistic often typically with the degrees of freedom in there as well. So, they're saying the value of the test statistic and the P value of the test. They might say it was a two two-sided p-value, give that additional information, but make a statement about whether you have the evidence is statistically significant or not here that the mean saying that the mean phase shift is significantly less than zero with the value of the test statistic and the p-value, but not explaining what significance means, but in the discussion they might talk about the broader implications, but when they're first talking about the results, just sort of bang out the results. Here, if we're going over here, remember if we're saying that we're just talking about a difference, that's true if we're talking about a difference. That was supposed to be an arrow. Let's try and arrow that up. If we're just talking about a difference, that's fine in a sense and correct, but it's not the information we're really looking for. The direction of the difference is important and so we really should be speaking in

### Segment 5 (20:00 - 25:00) [20:00]

terms of direction of the difference, although some argue otherwise. It's the way to go. Okay. Now, casually, let's say casually, I like having these three things. What it actually means mathematically, how an article might phrase it, and then if we're just if we're loosening things up a little bit and letting ourselves not necessarily be rigidly applying to the exact things that we can say. There's really strong evidence that people exposed to these blue light pulses, even through these closed eyelids as they were in this study, that caused a delay in melatonin onset. And that just tends to screw up your sleep. So, the blue light pulses, even through those closed eyelids in this study, which is that can still really screw up your sleep. So, watch those screens, especially close to bedtime. Um Okay, so a lot of different ways we can phrase these things. It's important to get this right and this is a conclusions and hypothesis testing are often botched, and so we really want to put some time into understanding what these results actually mean. In the real world, we carry out these tests with software, so let's see what that looks like for R. It looks similar for other types of software as well, but this is what we have in this case in in R. So, um we have the t-test. We're going to use t-test for these one uh test on a single mean. We're also going to use this t. test uh command in R for tests uh of the null hypothesis that two population means are equal. Uh and so this is t. test comes up quite uh quite often in R. It's a commonly used. Phase shift is a vector of the 16 phase shifts. So, this is my data in this phase shift. Mu uh equals zero. That's saying, "Hey R, please test the null hypothesis that ah the null hypothesis that mu is zero. " Please test zero. That's the default in R, but it's always good to put it in there uh to be explicit that we're testing this null hypothesis. Uh and then we can say the alternative. R doesn't know whether we want to use a two-sided alternative or the less than on one side or greater than uh on the other. Uh two-sided is the default, but I put it in explicitly here that we're testing a two-sided alternative. Uh and you can leave that out, and if you do leave it out, default in R is two-sided. So, it calculates this value of the test statistic. That's great. That's what we got. Degrees of freedom all is a useful check. 15 degrees of freedom. Well, we we had a quick check. We knew there were 16 observations, and for a one uh sample t-test like this, that meant 15 degrees of freedom. So, that looks good. And uh the alternative hypothesis R is saying that it's using this two-sided alternative. And so, that's what this p-value is. That p-value is the one we got up top, this p-value of about one in a thousand. R is taking the area to the left of this test statistic under a t-distribution of 15 degrees of freedom and doubling it. So, R is giving you the p-value for the two-sided uh alternative. So, we've done all of that and then we can do this really quickly in R, right? You put it into R, out it comes, and now our job is to interpret the results, and that's very, very helpful, of course. R also, as a default in this, it just gives you a very useful 95% confidence interval. So, this is like we've done all along. This 95% confidence interval, we talked about confidence intervals, it's doing it in the standard fashion here, finding this 95% confidence interval for mu, the true mean phase shift. So, we can be 95% confident that the true mean phase shift lies between these two values. This is also a useful thing to report. Now, there is a very specific relationship between hypothesis testing and confidence intervals, and we're going to I'll talk about that more formally later, but let's get the gist of it here. First of all, it's very important to recognize and to know that confidence intervals are important in their own right. There's many places and many sources out there when they talk about this relationship between confidence intervals and hypothesis tests, they act like that's the reason confidence intervals exist, and nothing could be farther from the truth. We've talked about confidence intervals before, we're estimating population parameters, they are very useful in helping us estimate population parameters, giving us this range of plausible values for a parameter. So, confidence intervals are important in their own right, and not only because of the relationship I'm about to talk about. Had we gotten this 95% confidence interval before and not carried out the hypothesis test? Let's say we hadn't talked about hypothesis testing and we just looked at this 95% confidence interval. We would say, "Well, wait a minute, this entire interval is to the left of zero. Zero is not in this interval. " So, zero, in a sense, is not really a plausible value of your population mean mu. All the plausible values of mu are negative. So, it really

### Segment 6 (25:00 - 30:00) [25:00]

looks like we have some evidence that mu is negative. Casually, you can think about it like that, but it is even a little bit more formal than that. And there that idea does hold when we're talking about this relationship between hypothesis tests and confidence intervals. It holds exactly here and exactly in a in a lot of spots. Uh and then there's some other types of things where it's not perfectly mathematically the same, but here it is. And what I'm saying is that if we just looked at this confidence interval, we're carrying out a 95% confidence interval, we found this, we would know just by looking at this that if we were to carry out a hypothesis test at an alpha level of 0. 05, because it's a 95% interval, if we carry it out a hypothesis test at a 0. 05 level, and this confidence interval is two-sided, so we have our hypothesis test two-sided, then this direct relationship holds. Zero is not in this interval, so I would know that I would have statistically significant evidence against the null hypothesis at 0. 05. In other words, I know that my P value up here is going to be less than 0. 05. Okay? No, that that's what's going to happen. So, there is that relationship. If had zero been within the interval, then I know that I'm not going to have statistically significant evidence against the null hypothesis at 0. 05, and that my P value is going to be bigger than 0. 05. So, there is that direct relationship. We'll talk about it more formally later, but that that's sort of a logical type of thing. It flows with how we think about this, right? Just sort of as reasonably intelligent human beings looking, this entire interval is over to the left of zero. All the plausible values of mu are negative, thereby giving us evidence against the null hypothesis, and evidence that the true mean phase shift is in fact negative. Hypothesis tests are important in their own right as well, and if we're carrying out a test at a specific significance level, then the confidence interval, the corresponding confidence interval, would allow us to say the same conclusion, but carrying out the hypothesis test gets us this p-value. And we do like to report the p-value because the reader can make up their own mind about the strength of the evidence against the null hypothesis, even if we intend to carry out a test at a fixed alpha level. So, in this case, we did carry out a test at fixed alpha level, said alpha is going to be 0. 05, let's carry out the test at that fixed level. But, what if we don't have a fixed significance level? And we're just thinking about the p-value. How do we handle that? So, we know how we handle the situation where we have a fixed value of alpha, in that we can say the evidence against the null hypothesis is statistically significant at that alpha level of significance if our p-value is less than or equal to that alpha level. We just put the blinders on, we forget that there's any nuance in the world, we don't think about anything, and we just say, "Statistically significant evidence at 0. 05, say, or we don't have statistically significant evidence at 0. 05. " But, life is much more nuanced than that, of course. And so, what do we do when we're not carrying out a test? In the real world, that's what we do with the blue light pulses really. If we're just talking about it and thinking about it, we don't have such a rigid binary decision here between a null and the alternative hypothesis. We are assessing the evidence. Hey, looks like a lot of evidence. The blue light pulses as this causes this delay in melatonin onset. So, what do we do? What can we say? So, first of all, this big notion that the smaller the p-value, the stronger the evidence against the null hypothesis. We're talking about just this study, right? That p-value is just talking about this study and this test and not any broader information we have about the problem. And uh also keep in mind that large p-values, 0. 42, 0. 86, you know, they give us absolutely no evidence against the null hypothesis. So, 0. 8 p-value or a 0. 7 p-value, they both give no evidence against the null hypothesis. We'll talk about that a little below here. But, for smallish p-values, the smaller the p-value, the stronger the evidence against the null hypothesis, all else being equal and talking about that particular study. But, it might help. It might help, and I think it does help, — talk about the the p-value and interpret the p-value if we know its distribution, the distribution of the p-value. So, let's look at it in a couple of different spots. I and at first, the distribution of the p-value under the null hypothesis. So, let's say we're about to carry out an ordinary z-test or t-test like we've discussed so far. Ordinary z-test, ordinary t-test, or many of the other continuous test statistics or hypothesis tests with continuous test statistics that we'll talk about later. This does hold a little bit more generally or a

### Segment 7 (30:00 - 35:00) [30:00]

lot more generally than these z and t tests, but this idea is important here, and we've only talked about z tests and t tests so far. So, suppose also that the assumptions are true. Whatever assumptions are true, they're all perfect and that the null hypothesis is true. Here is the distribution of the p-value if the null hypothesis is true. It is uniform, continuous uniform distribution between the values of zero and one. Now, it's kind of hard to discuss why this is the case. It's not the super ultra advanced mathematics, but it is beyond the scope of this course. So, this is one of those rare times where I'm just going to say, uh you got to trust me on this one. This is what it is. Uh so, under the null hypothesis this uh P value has a uniform distribution between 0 and 1. So, all intervals of equal length are equally likely to occur. Where have the same chance of getting something down here close to 0 as close to a half and close to 1 over here. The same chance of any of those. Your P value under those conditions just a random value plucked from there. And on average then, on average, our P value is going to be a half if the null hypothesis is true. Just a random value of plucked from between 0 and 1. Now, natural question, what happens when our null hypothesis is false? Well, the exact distribution of the of the P value depends on a lot of stuff then. How far is your hypothesized value from your true value? What is the variance? What is your sample size? Depends on a lot of stuff. But in all of those cases, when your null hypothesis is false, this distribution of the P value is going to be just sort of moving in some ways towards 0. So, if we have a look at this, here is one specific situation. This is one specific situation where I've taken 100,000 simulated P values in a scenario where the null is false. We're testing as uh two-sided alternative. The null hypothesis is that mu is 0, uh but mu in reality is actually two. Sample size of 50 and sigma is 10. So, you change all of those you're going to change the the exact shape of what you're seeing here, but still the general idea holds in that the entire thing sort of starts shifting over here, moving over towards 0. And we're getting a have a tendency to get P values closer to 0 when the null is false then when it is true. So, when our null hypothesis is false, and depending on the specifics of the situation, we are going to get this tendency to get p-values closer to zero more so than when it is true. And so, those small p-values happen or more likely under the alternative than they are under the null. And so, small p-values give us evidence against the null hypothesis and in favor of the alternative. Here's a rough guideline, just my take, others might disagree on this universally applicable, maybe we have other information that changes how we feel about things. Rough guideline here. So, again, we're carrying out if we're carrying out a test without a fixed significance level, we have the fixed significance level alpha like 0. 05, we just blinders on, we don't care about anything, just binary decision. But, in the real world, we're faced with this sort of thing a lot where we're assessing the evidence. And overall, you start getting p-values down less than one in a thousand, that's starting to be pretty strong evidence against the null hypothesis. If you're getting a little greater than one in a thousand and less than one in a hundred, say, there's a big difference between one in a hundred and one in a thousand in my mind for most of the situations, but still very strong evidence here. Strong evidence against the null when you get a p-value less than 0. 01. If you get between 0. 01 and 0. 05, well, now you're getting into a territory where it is statistically significant at the commonly chosen significance level of 0. 05, but it's not super strong evidence against the null hypothesis. There's a fair bit of difference between 0. 05 and 0. 01. 0. 05 is one in 20, 0. 01 is one in a hundred. That is a substantial difference in my mind, but just overall, we're in an area in that region where it is statistically significant at the 0. 05 level. It's not statistically significant at the point of 0. 1 level. So, we do have some, you know, moderate to strong evidence. We could think of it in those terms. You creep up a little over 0. 05, well, you know, maybe a little evidence against the null hypothesis, but it's certainly not very strong and by any stretch. You get up over 0. 1, now we're really starting to get into the area where it's we just do not have much there. So, maybe a little bit maybe a hint of evidence against the null hypothesis

### Segment 8 (35:00 - 40:00) [35:00]

or just we might just think of it as none depending on the context. Once you start getting up over 0. 2, I mean, this is really getting into a territory where there's just nothing there. So, you get your P values of 0. 28, 0. 43, 0. 61, just in my mind at least you're just thinking just no evidence against the null hypothesis. Uh note that I'm not having this a direct this the same this 0. 05. I'm not making these decisions based on this precise 0. 05 saying really strong evidence of 0. 05, no evidence of 0. 051. That's not it, okay? There's a continuum here and we're trying to sort of have say reasonable things about what our test tells us without a fixed significance level in those spots. And going forward in our test, we're going to be doing a lot of tests without a fixed significance level alpha and so we want to assess how much evidence we have against the null hypothesis and we're going to be basing that on our P value. Let's look at another example here. The nutrition information published by a popular fast food chain claims that in the US locations a serving of chicken nuggets contains 550 40 mg of sodium. And suppose as part of an investigation into nutrition labeling, you want to investigate whether the true mean sodium content differs from the stated value of 540. And so, if we're trying to see whether the true mean sodium content differs from this, then we're going to make our null hypothesis that the true mean sodium content simply is that 540, 540 mg. That's going to be our null hypothesis. And the alternative hypothesis, we have the choice of the three, right? The mu is less than 540, mu is greater than 540, mu differs from 540. And typically, as I push for, that we're going to use this a two-sided alternative because I would argue we're interested in a difference in either direction. If we only cared about a difference in one direction, we could make a argument for that putting that in the alternative, but here I would say let's err on the side of uh a two-sided alternative. This will enable us to get see evidence of a difference in either direction. So, our null hypothesis is going to be that uh the true mean sodium content in the servings of chicken nuggets of this type is 540 mg, and our alternative hypothesis is going to be that true mean sodium content differs from 540 mg. And let's say in this particular scenario, we are not going to carry out a test at a fixed significance level. We're just going to investigate what evidence we have against the null hypothesis. We're going to check that out. No significance level yet. Let's just see what happens and think about this as logical human beings. Okay. So, if we take a look at the data here, this is what we have. So, this sample uh put it all up here. A random sample of six servings, so we have an n of six. Uh from this fast-food chain contains these amounts of sodium. We calculated a sample mean uh and sample standard deviation. And uh again, I put a number of decimal places in our sample standard deviation here just to planting the seed here that we want to be uh carrying lots of decimal places throughout our calculation. Real world, we do this through software, of course. But, if you're doing this hand-cranking stuff, carry many decimal places throughout your calculations. And we have these six observations here, and I've put in the hypothesized value of 540 right there. Now, our sample size is six. This is a small sample size. There's nothing inherently wrong with small sample sizes. Small sample sizes, of course, do not give us as much information as large sample sizes, but if the assumptions of the procedures are met, it's perfectly valid to carry out a T test on a small sample size like this. We prefer large sample sizes because they give us more information. They also give the central limit theorem some time to work and help us out in terms of having X bar become more approximately normal if you're starting out with a distribution that's not normal in the first place. But, there's nothing inherently wrong with carrying out a test on a small sample size. So, let's see what that test tells us here. This particular time, I plotted a dot plot. We only had six observations. I think maybe this might be a little bit easier to see what's going on than with the box plot. I bet you could have plotted a box plot. This time, just did a dot plot. These six observations, here's our hypothesized value of mu. Overall, visually, I don't think this is showing us much evidence against the null hypothesis. Doesn't look like a lot of evidence against the null hypothesis to me, just visually, but let's see what the test has to say. The normality assumption is very important for such a small sample size. Unfortunately, size, normal quantile plots only tell us so much. But, let's have a look at this particular one. And this

### Segment 9 (40:00 - 45:00) [40:00]

particular one, we have the points falling roughly along a straight line. So, that is saying, "Hey, you know, these six data points are roughly normally distributed. " And so, maybe it's reasonable to think that our sample of size six is coming from a distribution that is approximately normal. It's a bit of a stretch for such a small sample size. The normal quantile-quantile plot only tells us so much information, but I think it's okay. I can give it a check mark cuz that normal quantile-quantile plot looks pretty good. Uh, and I'll say there's no big outliers or weird stuff going on. We should still feel not so great about carrying out a t-test on such a small sample size in terms of uh, normality, but I think it's okay here. Let's see what the test has to say. So, again, we take our estimator of mu, which is x-bar. We subtract off the hypothesized value, which is 540, and we divide by the standard error of x-bar, a type of test statistic that we will see in a lot of different places. If we do all of this, our sample mean minus 540 over s over root n, we get a test statistic of 0. 3506. What does that tell us? Well, for that, we go to the appropriate t-distribution. And we had a sample size of six, so our degrees of freedom for a one-sample t-test 6 minus 1 or 5, and so that is what I have plotted here, uh, the PDF of a t-distribution with 5 degrees of freedom. And this test statistic, if the null hypothesis is true, this test statistic is a random value plucked from this distribution. And we got a value of 0. 35, so we got something, you know, somewhere around here. So, 0. 35. That's the value of our test statistic. Looks like an ordinary average value to get from this distribution to me. So, casually speaking, just looking at this now, I'd say, I don't think we have any evidence against the null hypothesis at all. That's what it looks like to me. This is a typical value to get from this distribution. This is the distribution of the test statistic under the null. We got a typical value from this distribution. So, no evidence against the null hypothesis. That's what it looks like to me, but let's shore this up a little bit. The P value is uh going to be Well, what is the P value? Again, we're doing a test of the null hypothesis that mu is 540. And the alternative hypothesis that mu differs from 540. Mu is not equal to 540. So, a two-sided alternative. The P value is uh the tail area doubled because we are doing this a two-sided alternative. It's the probability of getting something the value we got. the value we actually got or something farther out in the tail. So, something far out to the right of. 35 under this distribution or you could think of it as or also farther out to the left of minus. 35. Or in other words, we want to take this area here, take this area, and double it. So, the double this area is our P value. And to do that in R, well, we need the area to the right of. 35 under this T distribution. So, we're going to go to software. However you do it, we need to use software to get this value. So, we want the area to the right, 1 minus PT the. 35 and the appropriate degrees of freedom which are five, and then we want to multiply that area by two, and in the end we get this P value of a. 74. And from our earlier discussion, that is a big P value. Okay, that's a P big P value with no evidence against the null hypothesis. None, zilch, zero. We have this typical value to get from this distribution. We get the P value, we get this big P value. I'm not saying none, zilch, zero cuz it's greater than 0. 05, I'm saying cuz that's a big P value. If the null is true, on average your P value is a half. We got something even bigger than that. Which is this is a big P value that in any situation you're looking at provides no evidence against the null hypothesis whatsoever. So, we have no evidence, no evidence that the true mean sodium content in servings of this type differs from 540 mg. That doesn't mean it's true. That most definitely doesn't mean the null hypothesis is true. We are not saying, we are not saying there is strong evidence in favor of the null hypothesis or anything remotely like that. We are simply saying we do not have any evidence against the null hypothesis.

### Segment 10 (45:00 - 50:00) [45:00]

The data we got was consistent with the null hypothesis being true. If the null hypothesis is true, we could have fairly easily gotten something along the lines of what we got. So, we have no evidence against the null hypothesis, but please do not go so far as to say something untrue like we have strong evidence that the null is true. Can't say that. Cannot say that. Please don't say that. We simply don't have any evidence against it. Now, note or I'm hoping this shows a little bit of why sometimes carrying out that test at a fixed significance level alpha could be a little bit silly. If we were carrying out a test at a fixed significance level alpha, then we'd say we do not have statistically significant evidence against the null at 0. 05. But, this P value here is way the heck bigger than that. There's a big difference between 0. 051, which is still close to 1 in 20, and this P value of 0. 74, which is no evidence against the null hypothesis. 0. 051 gives you a you know, it's not statistically significant at 0. 05. If you're doing the binary decision, it's the same decision as a p-value of 0. 74. But in reality, as humans living on this earth and trying to say reasonable things based on the given information, a p-value of 0. 051 gives a you know, a little bit of evidence against the null hypothesis. And a p-value of 0. 74 gives absolutely no evidence against the null hypothesis whatsoever. So, this is one reason why you should always report the p-value. Regardless of how you're carrying out your test or what you think the significance level should be, the reader should be able to make up their own mind based on the p-value. So, even if you are carrying out a test at a fixed significance level alpha, you report the p-value so the reader can make up their own mind. Note also the big picture-wise, the sampling design is really important. I've been glossing it over here in both examples because we have so much more to talk about, but the sampling design is very important. How did we get this sample? These the sodium content may very well differ between restaurants or regions or between preparers. We really to have a fundamental understanding of what information we have in the big picture, uh we should look take a closer look at the sampling design. Uh and the those ideas are are still important here even though I glossed them over a bit because we have so much more to talk about with the hypothesis testing and we could spend forever on one example. And again, in the real world, we do this with software. So, let's see what the R output looks like. Be similar for other statistical software. We are using t. test and we put the data, the six sodium concentrations in this variable called nuggets. And this mu is 540. That is to say, "Hey R, please test the null hypothesis that the true mean is 540. " Doesn't mean that This is talking about the null hypothesis, right? Null hypothesis that mu is 540. Alternative is two-sided. The test statistic, which is what we came up with, the 5 degrees of freedom, the P value that we found there, and so that does its calculations for us. It is a and we can focus on interpreting the results. Note also our 95% confidence interval. Once again, included as part of the default output in t. test is this 95% confidence interval, and we can say this at 95% confidence interval for mu, the true mean sodium content in chicken nugget servings of this type. And so we can be 95% confident that the true mean falls between these two values, and we can note that our hypothesized value falls in that interval. Yeah, So, had we not carried out the test, we would yet if we just had the interval, we would know when we go ahead and carry out this test because this hypothesized value, this 540 is within this interval, that the P value we get is bigger than. 05, and that we wouldn't have statistically significant evidence against the null hypothesis at. 05 because the hypothesized value 540 was within our confidence interval. Our P value is a lot bigger than. 05. The the our 540 hypothesized value is kind of near the middle of this interval, right? Leading to this really quite a large P value here. And so once again, that idea of carrying out the hypothesis test so we can report the P value and let the reader make up their own mind in terms of evidence against the null hypothesis is really important, even if you are carrying out a test at a fixed significance level alpha. So, a lot to all of this hypothesis testing, many places to go wrong. You want to be working through the exercises, get lots of experience in different situations so we can really clear up any confusions and misconceptions. It's a bit of a tricky thing in science. The basics are very

### Segment 11 (50:00 - 50:00) [50:00]

straightforward, but the nuances devil is in the details, so you got to put the work in to fully understand what's going on. Okay, next time a few more subtleties about hypothesis testing. We'll see you then.