Plotting categorical vs quantitative data with ggplot2

Plotting categorical vs quantitative data with ggplot2

Machine-readable: Markdown · JSON API · Site index

Поделиться Telegram VK Бот
Транскрипт Скачать .md
Анализ с AI

Оглавление (5 сегментов)

Segment 1 (00:00 - 05:00)

Hey everybody. Today we're doing some more data visualization with R. Working on plots where we have one quantitative variable and one categorical variable. As usual, I'm going to be working on in a quartto document here. And when I'm all done, I'll upload this to my website www. equitablequations. com. Check it out if you're interested. So the first thing I'm going to do is to load up Tidyverse. I'm also going to use theme set to uh set a minimal theme throughout. I'm not a huge fan of ggplot 2's default gray background. I'm also getting a little tired of putting plus theme minimal after every one of my ggplots. So, I'll go ahead and put that in my setup chunk. Um, by the way, this chunk option I have here, message false, is going to set this up so that when I click render, the long printout that comes with library tidyverse that you're seeing over here in the console doesn't show up on my rendered quarto document. Now, the sort of go-to plot when you have one categorical and one quantitative variable is the box plot. These aren't necessarily glamorous, but they have the merit that pretty much everybody understands how to read a box plot, at least broadly speaking. Even if they aren't thinking specifically in terms of like Q1, median, and Q3, they at least get an idea for what the a box plot is saying about the range of values in the data set. So, I'm going to be looking at the penguins data set throughout this video. That's in base R nowadays. If you have uh anything resembling an up-to-date version of R, you already have it. Let's just take a quick glimpse at that before we do anything else. Either learn about it or refresh our memory. So we have 344 observations, penguins, and we have eight different variables. Those are columns here. And the variables are things like species, island, bill, length, and so on. In this video, I'm going to be interested in plotting um the body mass of the penguins versus their species, trying to get a feel for, you know, which of the different penguin species tend to be the largest. So to start out, let's just get a very basic box plot. I'm going to use ggplot as usual as my workhorse function. I say the name of the data set first. Then I say what variables are to be displayed. How? I wrap that inside AES for aesthetic mappings. And in this case, I'm going to want species on the x-axis and body mass on the y-axis. When I'm all done with that, I use a plus sign and then say, what sort of geometry do I want? what sort of actual plot do I want? In this case, a box plot. So, there we go. You can see uh this is a very workmanlike and uh um reasonably clear plot. We have three different species. And you can already see that Gen Twos tend to be larger than the other two. Also notice that the sidebyside box plot that we're getting here displays outliers. You can see that in the chin chinstrap category, there's one exceptionally um light penguin and one exceptionally heavy one. Now, there are any number of ways we can touch up this basic box plot to make it communicate the data more clearly and also just to make it look more attractive. So, I want to make some of those changes right now. Um, first of all, maybe I'll just hit command enter here in this code chunk so that we can actually see where I'm trying to get and to zoom in on it quickly. The um most important thing that I've done here is to add a geome jitter layer to my box plot to display the individual observations in the set. And the code for that is right here. After geome box plot, I added geome jitter. Literally adding a layer to my plot. The alpha equals 003 here is saying that I want these points to be only 30% opaque to have to be 70% transparent. That's why you can kind of see through them here. The motivation for doing this is that box plots while they are very clear in showing the sort of uh um comparison of the groups and the overall spread of the data, they suppress the sample size within each of those groups. And so if we're just looking at the basic box plot, we can't tell if one of those boxes represents a million observations or just five. So when you have a data set where it's practical, it's kind of nice to add in a layer like this to show the actual sample sizes in one way or another. Now, some of the other changes I made here were a bit more cosmetic, but I think I'll point those out anyway. First of all, um, relating to the geome jitter, you'll see that I added an argument to my geome box plot. Outliers equals false. In my original box plot, I had these dots for the outliers for the chin straps. I didn't actually want these in the um the jittered plot or the version of the plot with the jitter because I already have points for those two outliers. You can see right here, this is the very heavy chin strap penguin. If I leave the outlier in, it will look like there's actually two very

Segment 2 (05:00 - 10:00)

heavy penguins, which you know is misleading in a small but potentially very important way. The um other things I've done here are purely cosmetic. In particular, I've added some color. So, um, I already have species mapped on the x-axis, but I decided I also wanted these bo boxes colored by species, so they're just a bit more distinct. Since I have species already labeled on the x-axis, however, I don't need an overall legend for this plot. So, at the very end of my code, I added a legend or added a theme layer with legend. osition equals quote none. There's multiple different ways that you can remove that legend. Um, this is one very simple one. I think it's a good idea here because the legend would be redundant in this case. Um, since I'm putting color into my plot, I'm going to change R's default color palette. Scale filled brewer pallet equals dark 2 is letting R know that it should change the way that the fill aesthetic is displayed. It's still going to map fill from the species variable, but now it's going to use different colors. colors from the dark 2 color palette. And this is generally considered to be more colorblind friendly than ours default palette. Whenever you you're using color, it's important to remember that human perception of color varies widely. Also, people will be seeing your plots on a wide variety of different monitors or potentially even printouts, and you don't want to rely on color too heavily as a distinguishing variable. The final thing that I've done with this plot is to add a labs um layer to this ggplot call. I am fixing up my labels a little bit. You can see body mass G is no longer bat body underscore mass and species is now capitalized on the x-axis. While I'm at it, I've also added a little caption here to site my sources. This data comes from Gorman Williams and Frasier. You can get the more specific references if you look at the help file with question mark penguins. Now, there are a number of different variations on the box plot that could be useful to you if you just want to show your data a little bit differently or to get a little bit of variety. The one that's most similar and I think is most worth considering is the violin plot. And so, the plot I'm going to show now really is just a simple variation on the box plot we just saw. So let me hit command enter on it. Here I've removed the genome jitter but you can certainly put that back in so that you can see the individual observations. Once again you can see that we have species on the x-axis, body mass on the y-axis and again I've colored this by species. This time I've used geome violin instead of geome box plot. And you can kind of see what the violin plot is doing. It's giving you the sort of densities of the points as we go up and down in the three different groups. I've kept my label change and I've also kept my scale fill brewer argument. I once again have added the theme legend. position equals none because of course the legend for the fill aesthetic here um the three different species would be redundant with the x-axis. As I mentioned a moment ago, we can also do this with a bar chart. I want to show that. I won't say too much about it. Let me just execute this code and we'll look at it first and then I'll make a few more comments. So you can see here we have very similar aesthetics. I have species on the x-axis. I still put my source in there and the color. So you can already kind of imagine some of the code that's going into this. Now the y-axis though is not just the body mass of the different penguins, but it's labeled average body mass. So each of these heights of these bars is representing sort of one summary statistic for the whole group. On the one hand, this may feel fairly intuitive. You very quickly can see the Gen Twos are larger from this plot. Um, but on the other hand, the problems with the box plot sort of suppressing the information from specific observations are even more present here. In this plot, we just have the average body mass. We don't even have the min, the max, the first quantile, and the third quantile. Um, another issue with this plot is that it kind of makes it look like we have more gen 2s than the other two kinds of penguins. And that's not what's being represented here. These aren't counts of penguins. So, if you're going to use a bar chart in this way, you want to be cognizant of the fact that your readers might be trying to read counts into this. That just tends to be how we think when we see bar charts. Now, again, I don't want to go into the code in of this in depth, but I do want to point out a little bit about it. So first of all, I have some code here that's actually getting those averages. And in this course, we haven't talked about um these functions yet. So I don't want to say much about them, say anything about them, except to say that the output is this little table where we have the three different species and the

Segment 3 (10:00 - 15:00)

average masses. And then I'll use this data set, which I'm calling penguins table in my ggplot. So ggplot, penguins table. Now I want to put species on the x- axis, average mass on the y- axis, and once again I'll fill it by species. This time I'm going to use geome call. Now if you've done some single variable plots with R, for instance, if you're following along with this course, you'd be thinking geome bar potentially. And there's a really important distinction here that I want to point out right now. So let's look at the uh the help file for geome call. And you'll see it's the same help file as for gome bar. There are two types of bar charts. Gome bar and gome call. Gome bar is going to make the heights of the bars proportional to the number of cases in each group. So this is fundamentally a single variable plot as far as ggplot is concerned. You only pass it one aesthetic. Gome call on the other hand requires two aesthetics. This is a two variable plot as far as our concerned. we pass it both the categorical variable x on the one axis um as well as we let r know the heights of the bars. So in this case you can see I have y equals average mass here. So long story short if you have one variable one categorical variable it's g bar. If you have two variables one categorical and one quantitative it's g call. So there's what I'm doing there. I have altered my label slightly to put in the word average here. And then I've also left my other things the same with the scale for Phil Brewer and the legend position none. Now in general again I recommend being very judicious with this plot but you'll see them all the time and occasionally will have cause to make them. So I wanted to get that in here. Another important and really common option when you're trying to plot one categorical versus one quantitative variable is to just think in terms of making a single variable quantitative plot and then using color to distinguish the categories. So, for instance, in this next code chunk, I'm going to make a histogram for this same data set. I'm going to still be looking at the body mass of the penguins, but I'm going to then fill it by species. So, let's take a look at that. And I'll take out this extra space that I have with my plus and then zoom in on it. So, a plot like this very much stresses the quantitative variable. So, you can see the distribution of body masses in this data set very clearly. You can see sort of the peak of the overall distribution. You can see this sort of tapering out um having fewer counts as the body mass goes up. Species is also being displayed, but it definitely has a secondary role here. And that's particularly clear when you have overlap between the different species like we do right around the 4500 g mark here. the chinstrap penguins being represented by little just bits of orange here, that sort of burnt orange, that's being placed on top of the Gen Twos, sort of the um uh higher up on the screen. And then a dailies are being placed even further up. So you have to bear in mind that for each of the bars here, as you look up and down, you are seeing the proportions of the penguins in that bar that have each of those different species. So long story short, it's stressing the quantitative variable, also displaying the categorical one, but kind of suppressing that. This is often a good option when you have a um categories that don't overlap so much. And a real go-to there can be the density plot that's grouped by the different categor levels of the categorical variable. So this next code chunk um is going to use genome density with a transparency level of 15 of 50% and then a lot of the other same things, the same aesthetics and the same labels. There we go. So you can see that the gen 2 penguins here on the right are you can see the distribution very clearly for the Adelis and the chin straps. the overlap. While you can see it, you can make it out, it's a little bit less obvious what's going on there. So, this is another plot to consider um that shows the distributions for the three different groups potentially very clearly. Just bear in mind that your reader, your viewer may lose track of uh what category is what if there's too much overlap. Now, these last two plots are relying on color as um a distinguishing variable here. So on this plot, if you aren't seeing the differences between the colors clearly, you're going to have a lot of trouble making out which of these curves is referring to which of the different species. And so that's a um a real fact, a real thing to keep in mind because of the dis um diversity of the way that humans perceive color and even the way individuals perceive color differently um in different situations. We all remember the sort of uh what

Segment 4 (15:00 - 20:00)

was it? gold dress, blue dress thing that happened on Twitter a while back and just all the controversy that caused and uh you know that is really a morality tale about the way that humans view color. So whenever you're using color as a unique identifying variable in your plots, you unique um identifying characteristic should definitely bear that in mind. The final option that I want to mention when you have one quantitative and one categorical variable is to use faceting. And this means just making your single variable plot but then having sort of one subplot for each of the different levels of the categorical variable. So I'll execute this code and point it out and then we'll talk about it zooming in. So as advertised here I have a separate histogram for the ad the chin tra chin strap and the gen 2 penguins. So you can see very clearly once again just as with that density plot the gen two penguins are heavier. This has the benefit of not relying on color as a sole identifier. You can see that the bottom plot is labeled as Gen 2 very directly. So even if you are fully colorblind and only seeing in black and white, you would still know which penguin is which here. Now there is a tradeoff. First of all, you just are using more real estate on your screen. You need to make three plots and they're going to take up more space. Um, but I think even more importantly is the fact that it's harder to compare um v the um counts that we have going up and down here. It's very clear the differences in the body mass. That's the variable going left and right. But because these plots are stacked one on top of the other, it's not so obvious how the counts directly compare in some of those bars kind of more in the middle. So you want to use this judiciously as well. Let me mention about the code here. Let's take a look at that. So, the same aesthetics, put body mass on the x-axis, use fill for color, make a histogram. You can take out the legend because um the different groups are already being labeled. And um the big difference here is I've added another layer. It's this facet wrap layer. Within facet wrap, I say what categorical variable I want to wrap by. So, you'll see I have the tilda species. So, wrap it by species. And then I use the n call argument to say I just wanted all of these plots put in a single column. Let's take that out for a second and see what it looks like if I don't have that. If you don't specify like end call or end row, R will make a logical choice for you. Here you can see it put everything in one row, which could be good for a lot of purposes. Here I can very clearly see the um differences in counts that overall there are more a daily penguins. we can see those bars go all the way up to 20 um over a similar width of ranges compared to the chin straps and Gen Twos where the counts are lower. On the other hand, it's less clear that the Gen Twos are heavier than the other two in general. You can see it, but you have to do a little bit of uh um calculation in your mind. You have to look at those x-axxes very specifically and think it through. So, each one of these plots that we've seen in these in this video have um while they're showing the same information, they definitely have different uses. And it's worth thinking about um what specific point you're trying to get across when you're making your plot and using that as a signpost when you're deciding which of these plots you want to use. Now, as I said at the very beginning of this vid, I'm going to render this document. I'm going to upload the quarto to my website. I'll put a link down in the um in the description. So before we wrap up, let's just render this and then see what the HTML document actually looks like at the end here. And I'll explain a couple of the choices I've made as I went through this video um as I created this CTO document. So here it is. You can see my setup chunk with the library tidyverse and the theme set theme minimal. I want to point out the code chunk options that I put in here with this hash um and then the vertical bar, the so-called hash pipe. I put a chunk label that it was a setup so that doesn't appear in my rendered version. And I used message false as I kind of mentioned earlier so I don't get the long tidyverse printout. Similarly, I've suppressed the warnings that would come in all of these plots because there are two um observations in this set that have a lot of NAS. So, like you may have noticed when I went and executed some of these commands before my geom box plot for instance, the warning message that two values were removed. I decided I didn't want this printing out throughout and I could have suppressed this individually with a chunk option here, warning false. And that would have done it in my rendered document for this one chunk. But I decided I didn't want to have to do that in every one. So in my YAML header here

Segment 5 (20:00 - 20:00)

I added an execution option. So the execute option in your quartto document will say apply this as a chunk option to every chunk in the um in the quartto document. So warning false got passed to everything. Okay. So if you want to go through that quartto document in a little more detail, please check out my website.

Другие видео автора — Equitable Equations

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Дайджест Экстрактов

Лучшие методички за неделю — каждый понедельник