# [bsr44] Introduction to Biostatistics: Chapter 12  Analysis of Frequencies (part 2/5)

## Метаданные

- **Канал:** statisticsmatt
- **YouTube:** https://www.youtube.com/watch?v=NTvzDPbyDN4
- **Дата:** 18.05.2026
- **Длительность:** 16:15
- **Просмотры:** 43
- **Источник:** https://ekstraktznaniy.ru/video/52889

## Описание

The videos on this YouTube Channel are not affiliated with The University of Missouri or my role as a professor at the University.

Here's a link for pdf's of certain videos. https://statisticsmatt.gumroad.com Also note that if a pdf of the video you are wanting is not uploaded yet, please reply in a comment that you'd like me to upload and I'll do it.

Help this channel to remain great! Donating to Patreon or Paypal can do this! 
https://www.patreon.com/statisticsmatt
https://paypal.me/statisticsmatt

Several playlists are in PDF format and can be purchased at 
https://gumroad.com/statisticsmatt

## Транскрипт

### Segment 1 (00:00 - 05:00) []

This is an introduction to bioatistics using R. We're in chapter 12, the kaiquare distribution and analysis of frequency data. And we're in section three, tests of goodness of fit. And here we're going to see how well a statistical model or distribution fits our observed data. So if we have a categorical variable, I guess technically could be continuous, but we'll talk more about that in a minute. that how well could we model it with a binomial distribution or a posson distribution or some other distribution. Now the kaiquare goodness of fit I think tends to work best with categorical data. But you can also um trick it in with a continuous. Say that we have a continuous variable and we want to see if a normal distribution fits it. And there we have to split it up into we create what's called bins. So we create little bins. We basically discretetize this continuous variable and then we find you know the probabilities that were in each of those and then how well you know our observed data fits into that. So you can use continuous distributions to see how well it fits your data. But there's better tests for when the data is continuous. [snorts] And so this is going to, you know, do the data conform to some theoretical expectation. And we're going to use the kaiquare goodness of fit test. Okay. And this is saying that you can it's designed for categorical data but you can you know not continuous but you can trick the continuous variable by discretetizing or binning the continuous intervals. Um [snorts] you know and that you group the continuous you like age ranges or income brackets but it's really this is better for discrete data. And then you have the your observed data and then you calculate the expected frequencies of course based upon the theoretical distribution and then we apply the kaiquare formula comparing the observed and the expected. It's the same formula that we used before. Um and so we'll of course do some examples. Now this is a caveat for using continuous data. Okay, it has its limitations. There's loss of information due to binning. You know, you're discretizing your continuous variable. So there's a loss of information. The arbitrary choice of intervals. So it's you know you'll always be questioned why did you choose those? And but there's alternative tests like the Komograph Smeirnoff which we'll cover in the next chapter non-parametrics and the uh there's other tests you can use. Okay. So let's do an example. So a quality control takes 50 samples of size 13. So the engineer goes in, grabs 13, looks at the number of defecties for that 13, [snorts] you know, observations and records it, puts those aside, grabs another 13 observation, records the number of defectives and does that 50 times. So the number of observations is 50 times 13. Okay, when we do this example, and one of the reasons I picked it is because we have to pay attention to the assumptions that the expected cell counts are five or more. [snorts] Here's the data that the engineer collected. Of the 50 samples, 10 of them had zero defects. 24 had one. 10 of the samples, remember a sample is 13 observations, 10 of them had two defects and so forth. You know, zero had six or more. We're going to test two hypotheses, of course, one at a time. We're first going to see if we can model this data with a posson distribution. So, the null is does the data follow a posson distribution or

### Segment 2 (05:00 - 10:00) [5:00]

not? And then the next test we'll do is does the follow does the data follow a binomial? Can we model this data with a binomial distribution? Of course, that's the null and the alternative is that it's not. Well, some of the trick here is if we're modeling this with a posson and we take samples of size 50, how many would we expect to see in here? Right? So if X is truly a posson and there's some parameter generally called lambda that governs that posson process, we need [snorts] to use it to calculate these expected uh frequencies. Well, first we have to estimate that and that's the mean of the process. That's what lambda represents. So let's do this in R. We enter our data and so this is the data that in the from the table before and we take the average number of defects. So we take 0 * 10 1 * 4 2 * 10 and then we divide by 50. Well we add them up then divide by 50. And so that's what we're doing here. this vector of one to six and or 0 to six and then the number of samples added divide by 50. So the mean number of defects is 1. 3 assuming it's a posson distribution. Well, now we need to calculate the expected frequencies and this is the density of a posson and we're using our estimated lambda. So this is the probability that we'll observe zero defects 1 2 3 4 5 six. This is a vector so it calculates those. But we have to be careful. This is six or more not probability that we observe six. So we have to go in and change that. So we take the probability of observing five or less subtract it from one and then we update that last category. To find the ex the expected cell frequencies we take 50 which is n times each of these probabilities and this is how many we would expect in each. So this is zero. This is 1 2 3 4 5 six. Now notice that the cell frequencies aren't all greater than or equal to five. So the kiquare distribution is not going to be reliable. So what we'll do is we'll bend them or collapse the categories. So let's just collapse all these into one category. It seems like it might work. So we reenter the data. I'm calling it X2. The original data is X. I don't want to overwrite that. And there's ways that you could program these to collapse, but it's a small data set, so I just reenter it. So I collapse them all. Now we have to calculate the expected cell frequency. So I find the posson density for zero defects one, two, three or four. But we don't want exactly four. We want four or more. So we have to recalculate the probability of three or less subtract from one and then we update that last category. Here's the exact um the probabilities of seeing each of those assuming a poson the expected cell counts would be 50 times each of those. And this is it. That's the expected cell count. Now notice this again is not five or more. We could run the kiquare test goodness of fit test and we it would give us a number but it's unreliable when there's less than five expected values. So let's collapse that into the next category. And that's what we do here. We update the probabilities in a similar fashion. And so these are all the respective probabilities of 0, one, two, or three or more. Expected cell counts are here and they're all five or more. So we can

### Segment 3 (10:00 - 15:00) [10:00]

calculate the test statistic which is O the observed samples expected frequency squared divided by expected frequency summed over each of those categories and there's our test statistic. Now it looks like I did not calculate a p value. So here there are one two three four categories. So the degrees of freedom is three. And this 3. 5 3. 58 I'll call it. We could calculate that p value by hand, but I don't. I just plug it into the kiquare test. And so this is our observed. Notice that I had to use the probabilities here because the probabilities are different for each category. It cranks out the results. Notice the test statistic is the same. And here's the p value 31. So do not reject the null. We can use a posson to model this data or there's no evidence that we shouldn't at least. So we do not reject. Now let's look at can we model the same data with a binomial distribution. So the null hypothesis is that data follows a binomial distribution and first so if X the two parameters are N and P. Well this is 13. We're taking samples of size 13 and P is the probability of a defect. So we need to estimate that. So we take the number of samples at zero number of one and two and multiply those add them and then divide by how many observations which are 15 50 * 13 and we get an estimate of 0. 1. So the probability of any observation is defective is 0. 1 and that's what we stick here. P hat is 0. 1. Now we can proceed. Now we need to estimate the uh expected cell counts. So we need the probabilities of each of those. So we use the binomial distribution. This is n and of course this is our estimate for p and these are the categories 0 1 2 3 4 5 6 plus. But this is the density the exact probability of six. But we want six or plus. So we have to update that last one. And so these are the probabilities. To find the expected cell count, we take 50 times each of those probabilities and we get this. But notice many of those are not five or more. this one here. If that was the only one that was less than five, so 4. 99, I'd probably leave it because it's close enough to five and your results would be, you know, they would be okay. But these are clearly not okay. So, we have to combine them. And we'd go through the same logic. Maybe we combine the last two and then we'll find that the expected cell counts are not five or more. So, then we got to combine it with this one. and then it we end up combining them all. I'm going to save you the details of that, but it's very similar to what we did for the po. So, we've collapsed bins or categories, and we're down with this one. So, now we got to calculate the expected cell counts. So, here's the probability of being in each of those categories. Notice D is density but that we need three or plus not three. So we calculate that the uh one minus the probability of being two or less and we update it. And so here are the probabilities. We take each of those probabilities times our sample size. And this is the expected cell counts. So now we can calculate the test statistic. the observed minus the frequency squared divided by the expected frequency add them up and that's it and so that follows a kiquare distribution and 1 2 3 4 so the degrees of freedom is three that 2. 79 so this is the p value

### Segment 4 (15:00 - 16:00) [15:00]

but I didn't calculate it by hand I just used the builtin function the kaiquare test. I plugged in our observed. I had the probabilities that we calculated for each of those categories. Notice the test statistic is the same as it should be and the p value is 042. So we do not reject. What that means is that modeling the sample data with a poson or a binomial will be acceptable. Okay. And this doesn't prove that the data is from a posson or is from a binomial. It's just saying there's not enough evidence to use something different. And so we could use both of those. Well, we're at 16 minutes. And so next um next video will be on the test of independence which is testing whether two categorical variables are independent from one another.