# Logistic Regression Example

## Метаданные

- **Канал:** MarinStatsLectures-R Programming & Statistics
- **YouTube:** https://www.youtube.com/watch?v=NRPY-RD5N80
- **Источник:** https://ekstraktznaniy.ru/video/44716

## Транскрипт

### Segment 1 (00:00 - 05:00) []

hi everyone my name is ben t and i'm an assistant professor in the department of medicine at ubc today i'm going to give you a quick walkthrough through a short r analysis applying some of the concepts that you learned in your previous lectures on logistic regression as you probably know it's often difficult to find public health data to illustrate some key takeaway points so the data that i'm going to be using today is fake data that i generated for the purposes of this analysis uh to form the basis of this fake data i thought it'd be fun to do something that relates to one of mike and i's big passions which is that we are big dog people and these are our dogs otis the boxer and pluto the poodle otis is an eight-year-old and lately he's been having some gi issues i ended up sucked in a deep dark hole in the world wide web a few months ago trying to find reasons why he would be having these issues there was a lot of websites that debated the use of raised dog food bowls to prevent gi issues some of which stated some benefits and some that stated that it made some these matters worse um i actually was very determined and tried or i did run a peer review literature search uh through pubmed a while ago to see if there were any peer-reviewed studies that looked at this and surprisingly found only one study by glickman and colleagues so this is the study here if any of you want to take a look further but they actually collected an extensive number of variables to look at the this issue of uh factors associated with what they call gastric volvulus or dilatation volvulus or gdv in short i created this data based on the glickman article but as i mentioned please keep in mind that the numbers and statistics that are calculated for from this analysis doesn't actually reflect any real associations and the data was generated for the purpose just of walking you through how to conduct logistic regression and r okay so the data includes 129 large purebred show dogs in british columbia that were randomly selected and the data was collected at one point in time from their owners let's load the data in we know there are a number of ways that we can import the data one way is if you've already generated an rstudio project and you've put your files or your data file within the project you can actually go to this file tab here you can go down to dog slx import data set you can choose to do this later but you can also do it here where you can change how r sees your data so these are all of the variables in the database and you you'll notice that age for example is a numeric variable so you can do this later but here too where i've changed and have our recognized age as a numeric variable similarly i can do the same with this meals variable which i know is also a numeric variable you can also change the name of the data set here but i'm going to keep it as dog once this all looks good go ahead and import your data set so you'll see in the environment that the dog data set has been imported there are a total of 129 observations and 12 variables you can also see here in the r script that i generated how the variables are coded so for example you can see height food bowl here which is the main exposure variable that we're going to be looking at is the height of the dog food bowl categorized as zero if is less than or equal to one foot or greater than one foot which is denoted as one so the key research that we want to answer today is there an association between height of dog food bowl and gastric dilatation evolvulus in large breed dogs

### Segment 2 (05:00 - 10:00) [5:00]

so before we do any analysis we should be looking at the data a little bit closer you can start just by looking at a summary of the data set you can also create tables and look at frequencies so here we're looking at frequency of the main outcome which is gdv you can see that there is 49 dogs who had gdv and 80 who did not and then you can also look at the main exposure which is the um height of dog food bowl you can see here there's 53 dogs who have a raised dog food bowl in 76 who did not similarly if you want to look at proportions instead of frequencies you can do that using this command so 41 of dogs have a raised food bowl so let's bring you back to what you learned last semester and create some graphical displays of the data we know that hypoall and gdv are both two level categorical variables so we can decide to create a side-by-side bar plot so here i first created a table of the two variables and i stored it in a function called t if you want to recall t you can do that here so now i can create a bar plot of tea you can see here it's not a very nice bar plot maybe you want some labels maybe you want it as a side by side as opposed to a stacked bar plot so here's a way to make it a little bit nicer you can use the command bar plot of the table t my title would be side by side bar plot of height of dog food bowl and gdv i can make x and y labels here using the x lab and the y lab to command y limit is the limit from where you want the frequencies to start and end names. arg command is if you want to change the labels from 0 to 1 to yes and no and then i can also decide to include a legend um that would show which of these are denoting less than one foot and which are denoting greater than one foot so here's a nicer version of the side-by-side bar plot you can also create a mosaic plot as we learned from two categorical variables so you also learned in spph 400 about using chi-square tests for assessing the relationship between two categorical variables the null hypothesis is that the two variables are independent and the alternative dependent so assuming that all of the assumptions are met to conduct this test let's run it with the table t if you didn't store the table as an object you can actually spell out the whole table here so here you'll see that the test showed a p-value of very small less than 0. 001 meaning that we can reject the null hypothesis and conclude that we have evidence to believe that there is a dependent relationship between height of dog food bowl and gdv but what's the direction of this relationship because we can't tell using chi squared we'll have to look into the direction and strength using other measures so as we learned last semester you can download the epi r package if you haven't already you can use the epi 2x2 command to look other relevant statistics like prevalence ratio odds ratio confidence intervals we'll indicate here that the method is cross-sectional so it won't spit out risk ratios

### Segment 3 (10:00 - 15:00) [10:00]

so please make sure that you're interpreting this correctly here the epi 2x2 shows that 80 dogs have the outcome whereas we know from our previous table that it's actually 80 dogs that do not have the outcome keep that in mind even though the way the 2x2 command presents it ends up aligning with what you're actually trying to model okay so now let's uh interpret the prevalence ratio which is that the prevalence of gdv among dogs with a raised food bowl is 2. 58 times the prevalence among dogs without a raised food bowl we can also interpret the odds ratio as the odds of gdv among dogs with a raised food bowl is 10. 26 times the odds among dogs without a raised food bowl that's really high you can also say that you're 95 confident that the true odds ratio is between 4. 47 and 23. 54 it's a pretty wide confidence interval so really how confident are we so don't fall asleep yet uh we're now just getting to the good stuff okay so let's learn how to construct a logistic regression model you can do this using the glm command here i'm asking r to store an object called mod 1 and i'm running a model with gdv as the main outcome height bowl as the main exposure and family equals binomial tells r that i want to be running a logistic regression i can also pull out a summary of mod 1 here using the summary command remember that logistic regression models the log odds so if you want the odds you're going to have to exponentiate the estimates to do this you can ask r to pull out the coefficients from mod1 object and exponentiate them if you're interested you can see what other statistics the model stores using the names command so if i go names mod 1 i can ask r to pull out the coefficients which i just did but can also ask r to pull out residuals or the aic for example you can also exponentiate the confidence intervals and here you'll notice that the odds ratio and the confidence intervals of 10. 26 and confidence intervals 4. 59 to 24. 34 are the same ones that you calculated using the epi 2x2 command okay so we see a relationship between rey's dog food bowl and gdv but are there other factors that could confound this relationship so let's look at whether age and breed are confounding factors remember confounders are secondary variables that we believe are associated with the main x and y variable but not in the causal pathway so we hypothesize that perhaps age is the confounder because maybe older dogs may be more prone to gdv perhaps likely to have a raised food bowl because it's more comfortable to eat similarly perhaps some breeds are more likely to have gdv because of genetic factors maybe some breeds are more likely than others to have a raised food bowl for some reason anyway let's construct a new model called mod 2 where we're looking at gdb as the outcome and then we're using height football breed and age as potential x variables asking r to create a logistic regression model to get a summary again i can use the summary command let's also go ahead and exponentiate the coefficients and the confidence intervals so if you go back up here you'll see that the p-value for height bowl is still less than 0. 05 we can also see that some breeds might have a p-value of less than 0. 05 as well

### Segment 4 (15:00 - 20:00) [15:00]

and we can also see that age might be positively associated with gdv you'll notice that after we controlled for age however and breed that there was a weaker relationship between raised dog food bowls and gdv so here we see an odds ratio of 7. 69 so we can say the odds of gdv among those with the raised food bowl are 7. 69 times the odds of that of dogs without a raised food bowl after adjusting for confounders now let's take a closer look at the breed confounder what breed are we comparing the other breeds to we're comparing them to akitas remember always sorts alphabetically so what we can say here if we take dobermans for example is that compared to akitas dobermans have 0. 69 times the odds of getting gdv maybe akitas aren't the greatest reference group though what if we want to change the reference group to labrador maybe labradors are more common so if we want to change the reference group what we want to do first is we want r to identify the breed variable as a factor using this command the as factor command so once we do that we can use the re-level command so i'm re-leveling the reference group to be labrador and then let's rerun the model and store it in an object called 2a let's also pull out a summary of this also we can pull out the coefficients again and the confidence intervals so ignoring other factors what breeds have a higher odds of gdv compared to labradors if you go back to the original summary here you can see poor otis the boxer has a significantly higher odds of gdv compared to labradors and this is even after controlling for various factors such as age and the height of your food bowl so let's say you're talking to a vet and they tell you that having a family history of gdv may explain the association between raised dog food bowl and gdv you may want to run another model that includes this variable so here i'm re-running a model called mod 3 where i'm adding gdv hist which is a two level categorical variable if you remember yes or no of a family history i can take out the summary here and what you can see from the summary is that the relationship between having a raised dog food bowl and jdv goes away once we include family history of gdv so maybe what this model is trying to say is that history of gdv may really be the driving factor in this relationship and it matters less whether not your dog has a raised food bowl and more whether your dog has a family history of gdv so using what you learned in your previous classes let's take a look at which model is a better fit one that includes gdv history or one that doesn't include it and in order to do that you can use the likelihood ratio test using the anova command here so i'm using anova comparing mod 2 to mod 3 and i'm using test equals lrt you can see here that the p-value is less than 0. 05 essentially what that means is that indeed a model that has gdv history as a variable as a confounder is a better model than one without there may be other confounding variables that you may want to explore here what i

### Segment 5 (20:00 - 20:00) [20:00]

suggest you do before you even touch the data or future data is that you draw a directed acyclic graph have a think about the various associations potential confounders covariates how each variable interacts with each other before you dive into the analysis so to conclude well we can't really because we know this is fake data but if it were real data maybe i'll decide that i'm going to do one of those embark dog genetic tests to see what the medical and genetic history of otis is before i decide to change the way that he eats thanks
