[bsr46] Introduction to Biostatistics: Chapter 12  Analysis of Frequencies (part 4/5)
24:39

[bsr46] Introduction to Biostatistics: Chapter 12 Analysis of Frequencies (part 4/5)

statisticsmatt 01.06.2026 70 просмотров 4 лайков

Machine-readable: Markdown · JSON API · Site index

Поделиться Telegram VK Бот
Транскрипт Скачать .md
Анализ с AI
Описание видео
The videos on this YouTube Channel are not affiliated with The University of Missouri or my role as a professor at the University. Here's a link for pdf's of certain videos. https://statisticsmatt.gumroad.com Also note that if a pdf of the video you are wanting is not uploaded yet, please reply in a comment that you'd like me to upload and I'll do it. Help this channel to remain great! Donating to Patreon or Paypal can do this! https://www.patreon.com/statisticsmatt https://paypal.me/statisticsmatt Several playlists are in PDF format and can be purchased at https://gumroad.com/statisticsmatt

Оглавление (5 сегментов)

Segment 1 (00:00 - 05:00)

This is an introduction to bioatistics using R and we're in chapter 12 part of this series the kaiquare distribution and the analysis of frequency. We're in section five tests of homogeneity and what that is are the distribution of the categorical variable the same across different populations and here could be a scenario. We want to compare the drug adherence rates yes or no across three hospitals. So we would collect data like this. So this is adherence. Yes. No. And this is hospital. So that's hospital. And we want to know are the adherence rates across. So is the proportion of yeses the same across these three hospitals? That's what we want to test. That's a test of homogeneity. Okay. So we set up one categorical variable cross two and two or more independent groups. this will make sense and says you know are do the groups share the same distribution of the categorical variable you know across each population. So oh actually this is an example from above the scenario we want to test the adherence rate across all possible hospitals. So if we follow this example and use this generic table then X1 is adherence rate and the populations are the different hospitals. So here we would only have two categories and this would be a yes no. And then here we have C populations, not just three, across them all. And we want to know is the distribution or the probability of a yes the same across all these hospitals. And this is a test for you know homogeneity. But it doesn't have to be two categories. There can be R categories for variable one and C categories for variable two. And we create this contingency table. It starts out very similar to a test of independence. We create that 2x two contingency table. Well, it's not 2x two. It's an R by C contingency table. Now one difference in tests of homogeneity is this that either the row or [snorts] column totals are fixed and the other's random. So for instance, if we kind of stick with this example of looking adherence rates for patients across hospitals there, we may say, you know, I want 50 subjects from each site. So we're going to fix these row total or column totals to 50. Now I could say get as many as you can. So well I got 50 here. I got 60 in site two. I got a you know 110 and site three but whatever that number we pick that it's fixed it's determined but now the number of subjects that are adherent or not are random which means these row totals are random. Okay. And that's what this bullet is pointing out. One is random and one is often fixed. Okay. Now, as a reminder, in the test of independence, which we did the previous section, both row and column totals are random. Okay. Well, now in the test for homogeneity, we need to calculate the table probabilities. Okay. So if we combine all the population that means collapse the columns together and we think of it right. So if we add all these together and create one column that's essentially this column right we're adding all the observations from that row in here. And if we wanted to calculate the probability that we're in

Segment 2 (05:00 - 10:00)

category A, well, this is how many we observed in category A out of our total sample size. So this number divided by that is an estimate of the probability of being in category A. And we can do that for the other categories. So when we look at this formula, the probability that we're in category A is how many were in category A divided by the sample size. And that's a good estimate. Now under the null, the distribution of X1 is the same for each population. Okay. So this estimate of the overall probability that's that we call it PA dot well if the null hypothesis is true and that the these two variables are you know their test for homogeneity it says the proportion of people here is the same as right given that total we would expect that percentage of people in the same way it's the it's equal across population. So we can estimate this cell probability. Okay. So note that point. So now when we collect or collect calculate the expected number of observations um for being in say category A for the first category of variable two that means the first column. Well we saw this many observations in the first column which is this number. out of that number, we had an estimate for that probability. So the probability times this number is the how many we'd expect to be in there. And that's what this formula says. We're taking the column total. We expect this proportion of people from there, you know, to be in that category A. We take its product. Well, there is our estimated cell probability. And this is absolutely amazing because it's the same as the previous section, the test for independence. It's the column total times the row total divided by the sample size. It's the same. And we could do this for being in population A or you or category A. But in the second column, it's the second column total divided by the our estimate of probability being in category A, which is this column total, row total divided by total sample size. So now we can calculate a test statistic using the kai square test. It's the observed minus the expected squared divided by the observed. We do that for every single cell. Then we have to add them up and that is our test statistic. And we can show that it's a kiquare distribution with r minus one time c minus one. So it's a kai^ 2 and the degrees of freedom r -1 c -1. Then we want to calculate this critical value which we generically call that and that's the rejection region. If our test statistic is greater than that we reject. If it's not we don't. So let's look at an example. So a study is to be conducted to consider the association between sulfur dioxide and the mean number of chloroplast per leaf cell of trees in the area. So we sample three regions and we pick them and we pick the sample sizes from each. So we pick a region that has high sulfur dioxide. We have one. We pick a region that has normal levels and then we pick a region that has low levels. We pick twomly pick 20 trees from each area and then we calculate the mean number of chloroplast per leaf cell. And we're going to classify the chloroplast into low, normal, and high. So let's

Segment 3 (10:00 - 15:00)

create a table, a 3x3 table. Notice these um these totals are fixed at 20. We fixed that. But these totals are random. We don't know what those are going to be. Well, let's conduct a test for homogeneity on this data. So, we enter the data and that replicates this. We could add margins if needed and really they're needed because we need to take the column totals and row totals divided by to get the expected cell counts and we use this function called outer. So we take the row sums which are here, column all possible products of those. When we're done, we divide by the total sample size and that is our expected cell count for each. Now notice that each of those cell counts are greater than equal to five. So the validity of the kiquare test is met. Now we need to calculate the observed minus expected squared divided by the observed. And now we add all those up. And this is our that's our test statistic. Well, let's just use the built-in function. We stick in our matrix X and it pops out the kiquare test. Notice the test statistics are the same. Here's the p value which says the reject the distribution of chloroplast across sulfur dioxide levels are not similar. They're not homogeneous. And this is this here is just showing you that in this test it collects so much information. And if needed, you can grab the observed. expected cell counts if you want. And there's plenty more. So, we're going to reject the null. And the expected cell frequencies are greater than or equal to five. That notice that was one of the assumptions. And um yeah, that was it. We reject. So in this section, we're going to look at Fischer's exact test and it's typically used for 2x two contingency tables and um it's some you know we want to summarize the frequencies of these categorical variables. Now the null hypothesis that we're going to test is no association between the two categorical variables meaning they're independent. Now the core of Fischer's exact test lies in calculating probabilities using the hypergeometric distribution. We're not going to go into great detail here, but we when we create a 2x two frequency table and we have the cell counts here. So that would be one in one two etc. this test um it does what it conditions on these row and column totals and then we can show that variable there follows a hypergeometric distribution and this is the density for it and those these brackets are you know the combinations formula in choose K now the particular values that K can assume there's A a range of values that depends upon the uh sample size or your population size and how many um are in the categor you know the in or in one of the categories but we're not going to go in detail of that. Um and Fischer's exact calculates the probability obtaining observed table any more extreme than what we just observed. so that it calculates a p value. Okay, this is the of course the key assumptions that the row and column totals are fixed and some people don't like some statisticians don't like Fischer's test for that reason that if they think that it's that they're truly random but this test works it's so powerful it's so robust it's worth

Segment 4 (15:00 - 20:00)

learning in my opinion so let's create a generic 2x two contingency tables for this Fischer's exact test. So this is variable one and this is variable two. So that these are the with the characteristic of interest. This is without And these are the different samples you know population one, population two or sample one, sample two. And the test statistic is that um P1 is equal to P2. So the so it's the proportion with the characteristic in sample I. So maybe this should have been a J. But so the P1 is here. How the proportion of people with the characteristic in sample one and this would be P2 the proportion with that characteristic in sample two. And we want to know are they equal or are they different? And that's generally what you test. But you can create these one-sided tests if you want to. And it all depends upon that number right there. So if we're calculating this one-sided test, then we're going to sum all table probabilities for values of a or smaller. If we're calculating this one-sided test, then we sum all the probabilities of a greater than b. And if we think about that we in the null we think P1 could be less than or equal to P2 right so if this is associated with P1 and this is P2 and we think P1 could be less than that and we observed a observations in that cell to be more extreme what is what the P and that's what the p value is calculating the probability we're more extreme. The only way to get more extreme is if that number goes down. So we decrease it by one by one all the way to zero if we calculate the table probabilities for all of those and then that's the p value but that's how we get less than or equal to a and the same way here if we think a p1 is greater than p then we say sum the table probabilities of a or greater which means that it's even more extreme and um if the null is they're equal. Then we sum all the probabilities have a table of less than that current table. Take that as the p value. Okay. One note though that R focuses and it's equivalently on the odds ratio instead of that exact probability. And we're going to develop this in a later section. But here's an example. And this is what's called the famous Fischer tea tasting experiment, also known as the lady taste and tea. And the experiment was conducted by Ronald Fiser to see if the lady could distinguish whether milk or tea was poured first into a cup. This lady claimed that she could tell. And so the experiment involved presenting eight cups and four each type. Four of them had tea first and the other four had milk first. and they're presented in a random order. Now, the Fischer's null was that the lady could not distinguish between the cups, meaning her success was due to chance. This is the data from the experiment. And to me, it's kind of crazy. They got six of the eight correct that you they could she could tell whether milk was poured first or tea was poured first. Well, we take that data and we just put it into Fischer's test, which since there's no additional parameters, it's a two-sided test and it pops out this information

Segment 5 (20:00 - 24:00)

and the p value is 048. So that says there's there is not enough evidence that she could distinguish between milk poured first or tea poured first, which is a little surprising because it seems like that she can. But since there's the data or the sample size is so small that it's I guess it's not surprising that there's not enough evidence. Now, it also um provides an odds ratio which we're going to cover in one or two sections down the road. So, I hope you come and revisit this. So, the fissurers we could do a greater than test and as a reminder if it's that. So this is the that's the alternative. So that means we need to calculate values less than that you know or we take where we go here we calculate the smaller values we two then one and we distribute each of those according here is the p value. So, the one-sided p value, notice that it's half of the two-sided and there's still not enough evidence. And if we do the other way, which is kind of silly in my mind, the p value is 0. 98. There's clearly not enough evidence that to reject the null in this example. And let's see how much we have. We'll cover this and then call her quits. So we're going to consider whether twin brothers of the known a twin brother of the known convict was himself convicted or not convicted. So we're going to form a 2x two contingency table. We have 12 convicted and 18 not convicted. Twins were classified as um these monozygotic and disyiggotic twins. Okay. So remember the scenario is one of the people one of the twins have been convicted. So now we're looking at the twin to see if they've ever been convicted or not. Okay? And we're going to classify it on the type of twin and whether they've been convicted or not. Well, the Fischer's exact test and let's do an a one-sided test p value is small. So the probability these two probabilities probability being in this category or this category are not the same. The probability of being a disigotic convicted is less than being a disigotic not convicted. And that's what this Fischer's test just said. So the alternative is less than. So P is less than two. Let's do a two-sided test here. We're not going to produce the confidence interval just to show you that that's an option. And the two-sided p value is also highly significant, meaning those two probabilities are not the same. We can add a 95% confidence interval. Just showing you that that's a possibility. That's all. It's again a two-sided test because there's no al alternative listed and that's the same as we got before. Okay. Well, that is it for this and uh next video we're going to do section seven which is relative risk odds ratio and the mantle hansel test.

Другие видео автора — statisticsmatt

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Дайджест Экстрактов

Лучшие методички за неделю — каждый понедельник