Understand the grammar of graphics

Understand the grammar of graphics

Machine-readable: Markdown · JSON API · Site index

Поделиться Telegram VK Бот
Транскрипт Скачать .md
Анализ с AI

Оглавление (3 сегментов)

Segment 1 (00:00 - 05:00)

Hey everybody. Today we're getting started with the grammar of graphics. There's not going to be a lot of code in this vid. Instead, we're thinking about data visualizations a bit more structurally, asking, how do we describe a data visualization? Or to say it differently, if we're sitting down with some graphing software, what information do we need to specify in order to get the plot that we're looking for? A relatively new way of doing this is the grammar of graphics which was first introduced by Leland Wilkinson in 2001 and really became popular with the advent of Hadley Wickham's ggplot 2 R package in 2007. And the fundamental insight here is that the most fundamental element of any data visualization is the variables that are being represented and the dimensions that are used to represent them. So for instance we might plot one variable on the x- axis another on the y ais another using color another using the shape of points etc. The exact sort of plot that we're making the sort of geometry that we're using to represent the data is actually a secondary feature here and should be specified after the variables and the dimensions used to represent them. Finally stylistic elements can be specified separately. These are design considerations that aren't defined by the data directly and so can be considered separately. Here we're talking about things like um axis labels, fonts, color pallets, etc. Here's a motivating example to illustrate the primacy of the variables and the dimensions used to represent them in our plots. These are two plots based on the diamonds data set that comes with the ggplot 2 package in R that I mentioned a minute ago. And um we are seeing carrots of a bunch of diamonds. You can see that in each plot I have carrot on the x-axis and count on the y-axis. Count just being literally the number of observations of diamonds that fall into each of the bins on that x-axis. And it's not a coincidence that these two plots look very similar despite the fact that one is a histogram and one is a frequency polygon. They have exactly the same variables in exactly the same places. If we were designing a graphing system, we would love to be able to switch from one to the other easily. We would even love to be able to superimpose one on top of the other easily. Again, the philosophy being that it's the variables and their positions that are prime that are primary, not the exact sort of plot. Here's another example using the crickets data set, which comes from the model data package in R. Here we're representing temperature on the x- axis, chirp rate on the y- ais, and species using color. So we have a bunch of crickets from two different species. They're being exposed to different temperatures, and their chirp rates are being recorded. Some of the non-data aspects of this plot include the choice of color palette. So I have sort of an orange and a teal here. The uses of the use of x's to represent points in my data set rather than say dots or boxes or something. The position of the legend there on the right would be another nondata aspect of this plot. The grammar of graphics is particularly popular among R users because of that ggplot 2 package and uh its workhorse ggplot function. I'm not going to talk about the code used to make that plot on the last slide really at all except to point out that the very first thing that's specified after the name of the data set is um the variables and where they are being positioned. So that's the AES parenthesy X equals temp, Y equals rate, color equals species. So there you can see the variables being used and the dimensions used to represent them. Now there are lots of different ways you can represent variables sort of in uh order of uh how common they are. Here are some of the most common dimensions we use to represent variables. The x and the y axis obviously boundary and interior colors and um those are potentially different things. For instance, if we have a box plot or a histogram, we might want to color the interior or the um boundary of those boxes. and one or the other or both might be appropriate in different circumstances. Size can be used to represent a variable in a data set. Most often, you know, sample size, number of observations of a certain sort might feed into the size of points on a scatter plot. Shape, line type, and finally opacity, sort of how transparent the points in our plot or the boxes in our plot are. In R, we refer to these as aesthetics or aesthetic mappings. mapping referring to the fact that we are taking a variable in the data set and mapping it to a dimension on the plot. Now, one thing you have to be careful with when you're talking about the grammar of graphics, especially when you're thinking about aesthetics like color and shape, aesthetics describe how variables in a data set are represented. So, the part I want to really underline there is variables in a data set. When a

Segment 2 (05:00 - 10:00)

property of a graph like color or shape isn't determined by a variable, it's not an aesthetic, but rather a stylistic choice. So in that scatter plot we saw a second ago, color was being used to specify what species of cricket we had. In contrast, we could have made all of those X's blue, just specifying blue as sort of a global option that's not coming from the data set in any way. All right, so let's do some examples. I'm going to give you a few different plots. I recommend pausing the video before each one, thinking about what are the aesthetic mappings. In other words, what variables are being represented and what dimensions are being used to represent them. And then try and name some non-data aspects of the plot. Some of the design choices that have been made that aren't coming directly from the data set. Let's start off with this histogram, which is based on the faithful data set, which like most of the data sets I'm using today comes built in with R. In this case, we have a single variable plot. The single variable being the waiting time between eruptions of that geyser. And that's being plotted on the x-axis. On the y-axis, we have count. That's not a variable in this data set. Rather, it's just literally the number of observations, the number of eruptions that had waiting times in the specified bin. There are any number of non-data aspects of this plot. You can see the title, the font, the light blue that I've used on the inside, the dark boundary of the bars. I've also put a little uh source note there on the lower right. Next up, let's talk about cars. This plot's derived from the NT cars data set, another R data set. It's showing the prices of various cars back in 2005. So on this plot, you can see we have a couple of different variables being represented. The price of the cars is on the y- ais and the number of cylinders is being represented on the x-axis. Here we're treating number of cylinders 4, 6 and 8 as a categorical variable. Now in this case the number of cylinders is being represented a second way as well with the colors. Each of the different categories has a different color. And so this illustrates the fact that a single variable in your data set can be mapped to multiple dimensions on your plot. the relationship doesn't have to be one one. All right. So in addition to those two variables and the three aesthetics that are being used to represent them, we have um we have to talk about the geometries. So there's actually two here, two of those as well here. First of all is the box plot which is representing the five number summary for um the price variable in each of these three categories. So sort of giving us summary information about price in each of those groups. In addition to that though, we've overlaid a jitter plot. So each one of those dots is representing a single observation in the data set, a single car. So this is illustrating the fact that the ge the grammar of graphics can be layered. Once we've specified our aesthetics, we can specify more than one geometry and that's often a desirable thing. Here's another example in a similar vein. So we've already seen this crickets scatter plot. In fact, this is exactly the same scatter plot that I showed in my slides earlier. The only difference is now I've put regression lines on top. So to summarize the aesthetics, temperature is on the x-axis, chirp rate is on the y-axis, and species is being represented with color. But now in addition to the gome uh jitter geome sort of the scatter plot with a little bit of noise introduced, I've also put in regression lines. And so that's layering on again on top of the data of on top of that scatter plot but using the same aesthetics as before. Here's another one to try. This is based on the Scooby-Doo data set. Um I will make sure to have a link to that. Um I'll put it up on my website. I have a link down below in the description. So, take a second, try and figure out the aesthetics here as we look at uh um the number of times Shaggy, a character in that show, says the word zoinks in different episodes over the years. So, in this case, we have a few different aesthetics. We have the year um going left and right. We have the number of zoinks per episode, sort of the um number of times Shaggy says zoinks in each episode. These are um averages based on entire seasons of the show Scooby-Doo. And so when we're representing averages, we'd like to say what the sample size is for each one. And so that's being done using the size of the points. So our third aesthetic here is the number of episodes in a given year. And you can see that the points go from relatively small at around 10 episodes per year up to fairly large at 40 episodes per year. So that's

Segment 3 (10:00 - 11:00)

a third aesthetic here. Some of the non-data aspects of the plot might include the uh fact that we've used circles instead of squares or something else. The title, the fact that I put a little box around the whole plot, the way I've labeled things, the um way I've labeled the x-axis, for instance, just going by every 10 years rather than say every five. One last example. Take a second, think about aesthetic mappings here as well as what geometry is being used. So in this case, I see network on the x-axis. Again, we're talking about Scooby-Doo here. So what network was airing the episodes? We have monster reel being represented with the interior color of these bars. So in R, that's called the fill aesthetic. On the yaxis, we just have number of episodes. So that's the number of rows in the data set, the number of episodes of Scooby-Doo that actually fell into each of these categories, you know, with each network and with each monster real status. So in this case, we have two variables, two aesthetics. There's a single geometry here, and that is the fact that we're making a bar chart. We're representing our data using the interior of bars. —

Другие видео автора — Equitable Equations

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Дайджест Экстрактов

Лучшие методички за неделю — каждый понедельник