[bsr45] Introduction to Biostatistics: Chapter 12 Analysis of Frequencies (part 3/5)

14:16

[bsr45] Introduction to Biostatistics: Chapter 12 Analysis of Frequencies (part 3/5)

statisticsmatt 25.05.2026 53 просмотров

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Описание видео

The videos on this YouTube Channel are not affiliated with The University of Missouri or my role as a professor at the University. Here's a link for pdf's of certain videos. https://statisticsmatt.gumroad.com Also note that if a pdf of the video you are wanting is not uploaded yet, please reply in a comment that you'd like me to upload and I'll do it. Help this channel to remain great! Donating to Patreon or Paypal can do this! https://www.patreon.com/statisticsmatt https://paypal.me/statisticsmatt Several playlists are in PDF format and can be purchased at https://gumroad.com/statisticsmatt

Оглавление (3 сегментов)

Segment 1 (00:00 - 05:00)

the introduction to bioatistics using R and we're in chapter 12 which is the kaiquare distribution and the analysis of frequencies and we're in section four tests of independence. So in a clinical setting, test of independence of course are a statistical tool used to determine whether two variables are associated meaning do they operate independently from one another like if we have two continuous variables a lot of times we'll use Pearson's correlation or Spearmman's correlation to see if there's a relationship between two variables. But how do you do that when you have two uh qualitative variables or two categorical variables? Okay. And let let's say we're in a hospital and we want to know whether smoking status is associated with postsurgical complications. We could do this test of independence to see if those two variables are independent. Okay. So we're going to use of course the kiquare test of independence and it's very similar to what we've done before but we do it with contingency tables. So in this simple example, we're going to create a 2x two table and then on the left side we'll see whether they smoke or you know or smoking status yes no and then the across the top we'll look at complications yes no that's an N and then we'll tabulate the observed data and then we'll conduct a kaiquare test and we'll develop that as we proceed through this. Now, in this example, smoking had two categories, two levels. Comp uh postsurgical complications had two levels, but they don't have to be. They can have as many as you want. Okay. So here's the general test of independence of two categorical variables. Just call them generically X1 and X2. X1 has R levels and X2 has C levels. And we create this generic contingency table. Now the levels for X1 go this way. So there's R levels and R is used because it also tells you how many rows are in your contingency table. X2 has C levels and of course C levels where it's for columns. It tells you how many columns in your data set you have. And as you collect this bariate data, you tally the the observations. And so this one here, there were in one two observations that had level one for variable one and level two for variable two. So this is the count and in the formula notation [clears throat] that's also going to be O I can be specific 012. That's how many observed values we have from our sample in that table. And perhaps I should have used O instead of N, but I think N is more classically recognized as that's how there's N observations in there. But the kiquare to be consistent with all the other sections the formula for the kaiquare test uses O which stands for observed. Okay. Well if we have the observed we need to start thinking about the expected cell counts. So how many you know given this table and the null hypothesis that the two variables are independent how many observations would we expect to see in there? And this is what we're getting ready to develop. Okay, so the probability that we're in category I for variable one and category J for category 2 is this formula and to be very specific if we wanted to estimate P12 the probability of being in there based upon the data we want to estimate it well there's that many observations that are in that category

Segment 2 (05:00 - 10:00)

out of the total sample size. And so that's going to be our estimate for P12. So that I put approximately there. That's it. That's our estimate. And that's what this formula is saying. That's an estimate for that cell probability. Now, if we only want to look at what's the probability that you're in this category, well, how many observations fell into that category? It would be this one. Notice the dot in the index. That means we're adding up over that index value or these, right? The first one is a once. We're adding all these and putting it into the, you know, the row total. So, we would estimate to be in category one as that number divided by our sample size. And that's what this is representing, right? If to be in category I, it's the row total or how many fell into category I divided by our sample size. And that's going to be our estimate. But we could also do this for categories of J. Like what what's probability we're in category two of variable two? Well, it's that number divided by the total sample size. And that's what this represents. Now as a reminder if two events are A and B are independent so that means the probability of A and B or also A intersect B same if they're independent we can write that as the product of those probabilities okay now why am I showing you this remember that the null hypothesis is that our two variables are independent Okay. So if they're independent, this cell probability and as a reminder, this is the probability that X is equal to I. And so that's what the comma means. I could put an intersection sign there. That's what this equals. Okay. Well, if these are independent, it's the product of being in category I times J. So we would we can estimate this probability assuming independence as the product of those two probabilities. Well, that means when we calculate the expected cell frequencies, we take our total sample size times the probability of being in that cell and that tells us the probab the expected number of observations we would see in that cell. But we can split that into the product of these probabilities and then we write everything out. that end cancels with this leaving this. So this is it. That's the formula for the expected cell counts. It's the row total times the column total divided by the total sample size. Now after we've collected the sample and observed the cell frequencies which we denote by OIG, then we calculate the test statistic. And it has the same flavor of the previous sections in this chapter. We it's the observed cell frequency minus the expected squared divided by the expected and then we ha have to add up each of those for each cell. And that's what the double sum is because we're summing over a table. The test statistic follows a kiquare distribution with r minus1 * c minus one. And then to calculate the rejection region, you know, I draw the same k squared distribution for whatever the degrees of freedom are. So this is r minus one, c minus one. And then we want to find this area to be alpha. And that point we denote by that. And if the test statistic is greater than that, we reject the null and conclude that the two variables are not independent. And if it doesn't fall in the rejection region, we say that there's not enough evidence to say they're not independent. Right? We never prove that they're independent. There's just no evidence to say they're dependent. So let's do an example. Let's sample 750 people and they're classified

Segment 3 (10:00 - 14:00)

according to income and stature. And we want to test, you know, are these two factors independent? Let's test it at the alpha. 05 level. So here's our observed data. Income is poor, middle class, or rich. And stature is thin, average, and fat. And technically I could I probably should say you know low BMI, average BMI or high BMI but this is it is what it is. We have row totals and column totals. Now we need to calculate the expected cell counts. So this right here is 200* 270 divided by 750. Boom. This cell count I'm just grabbing one at random would be this row total times the column total t divided by the total sample size. And then that's the exact and that's the expected cell frequencies. Well, we want to compare the components. If the observed is close to the expected, well, they're an independent, but if they deviate too much, then they're probably not independent. And so that's a reminder how to calculate these. So let's do this in R. We've entered our data into a data or actually I entered it into a matrix. Um we could add margins to the matrix. While that's cool, it's not necessary, but we do need these to calculate the expected cell counts. And we need every possible combination. you know the first times each of those, the second component times each of these, the third. And there's a cool function called outer which allows us to do that. We take the row sums which is this column sums. We outer multiply each of those. So it takes every possible combination and we're dividing by the total sample size and boom it creates this matrix of expected cell counts. Well now we need to by cell. So like this cell minus this cell squared divided by that cell which is what this is. So these are the you know part of the test statistic. Now we need to add up each of these and that's our test statistic and we get 125. 6889. That's a pretty large value by the way. But this could all be done using the kaiquare test. X is a matrix which was here and we just say kaiquare test. Boom. And then it does the Pearson's kaiquare test for independence. Notice the test statistics values are the same. And we get this p value which is incredibly small. So we're going to reject the null hypothesis and conclude that stature and income are not independent. Now this test statistic it stores so much information and this is just an illustration to show you that you can grab that information. So if we just wanted the observed data, we wanted the expected data, we could do this. Okay, 14 minutes. Let's go ahead and stop here. The next will be on test of homogeneity in section five. So we'll see you then.

Другие видео автора — statisticsmatt

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник