Tidying data with pivot_wider
15:54

Tidying data with pivot_wider

Equitable Equations 18.05.2026 287 просмотров 25 лайков

Machine-readable: Markdown · JSON API · Site index

Поделиться Telegram VK Бот
Транскрипт Скачать .md
Анализ с AI
Описание видео
What can you do if a data set has more than one row per observation? Reshape it, of course! If this vid helps you, please help me a tiny bit by mashing that 'like' button. For more #rstats joy, crush that 'subscribe' button! You can find content supporting this vid on my website (https://equitableequations.com/posts/2025-11-10/), which you should totally check out anyway!

Оглавление (4 сегментов)

Segment 1 (00:00 - 05:00)

Hey everybody. Today we're talking about reshaping data that comes in a two long format. This is a situation where we have multiple rows per observation and where some of the cell values in our spreadsheet are actually names of a variable. Don't worry if that sounds abstract. I'll show you an example in just [clears throat] a moment. So pretty much all of our most important R tools for visualization, wrangling, and modeling assume that data is in a tidy format. That is to say, situation where every row in a data set represents one and only one observation and every column variable. And sort of our tidy verse tidy models philosophy is that if we have a data set that's not in that format, we should start by reshaping it, getting it into a standardized format that we can then work with our standardized tools. So let's take a look at this blocks data set that I'm going to be working with throughout this vid. This comes from the GLMs data package, which includes a lot of fun data sets. These are data sets supporting the excellent book Generalized Linear Models with Examples in R by Dunn and Smith or possibly Smythe. Hey Mr. Smythe, if you're out there, let me know. Then I'll pronounce your name correctly from now on. This is a pretty mathy introduction to Generalized Linear Models. If that's your cup of tea, I recommend this one. As it stands right now, the blocks data set has 100 rows. What we're seeing here are observations from a block stacking experiment back in 1931. There are 25 children labeled A through Y and for each one of those 25 children, they have stacked cubes and they have stacked cylinders in separate trials. And then for each one of those two shapes, they have done two separate trials. So, trial one, trial two. So, 25 children * 2 shapes * 2 trials each is 100 observations. So, for a lot of purposes, this would constitute tidy data. One observation is a child stacking one kind of block in one of the trials. However, there are certainly plenty of applications where we might want to consider an observation to be just one child, or potentially one child with one particular type of block. In that case, we'll want to reshape the data. So, I'm going to start by trying to make this data set um have observations be individual children. So, I'm expecting something with 25 rows, one per child, and then I'll have more columns because if I'm having fewer rows with the same information, I certainly expect more columns. So, I'm going to need one time column and one number column for each trial and for each shape. In other words, I'm going to need columns like trial one cube number, trial two cube number, things like that. The primary tool we're going to use is pivot wider, duh, name of the video and all. In order to make use of this, to deal with the situation where we have more than one row per observation, we're going to need to specify three things. First of all, we're going to have to let pivot wider know which column or columns uniquely identify our observations. So, in this case, child is going to uniquely identify our observations. Next, we sort of need to say what the problem columns are. We're going to say which columns have cells that currently contain variable names. So, in situation where we want each row to be a child, that means that trial number and shape are both actually containing things that should go into our column names. Remember, we're going to want things like trial one time cube or, you know, trial one cube time, things like that. Finally, we need to say which columns currently contain the corresponding values of those variables. And so, in this case, we're going to be looking at what time and number stacked. Notice that age is going to be the same for each child already, even across the different trial trials and the different shapes. All right. So, when we're using the pivot wider command, we're essentially specifying these three things in order. So, here's the code I'm going to use to get to a data set that has one row per child. I'm going to start by piping the blocks data set into pivot wider, so that'll pass it as the first argument in pivot wider. Then, the IDs call are ID calls argument

Segment 2 (05:00 - 10:00)

is going to specify which columns are identifying unique observations. So, in this situation, child is going to identify unique observation, child A, B, C, et cetera. The columns that currently have values that should be parts of variable names, in this case, are trial and shape. And so, I pass these to pivot wider as a vector. So, C {parenthesis} trial {comma} shape. And then, finally, values from is saying which columns have the values for the variables that we've just specified. So, number and time. By the way, these argument names from and values from parallel and reflect argument names in the pivot longer command where we have names two and values two. Pivot wider and pivot longer are, roughly speaking, inverse operations and you can undo one with the other under most circumstances. So, let's actually execute this code and I'll just take a quick glimpse to see what we got. We now have 25 observations, 25 rows, which is exactly what we would hope for since we have 25 unique children in this data set. And now, look at the column names that we have. Number_1_cube, number_1_s- um number_1_cylinder, etc. So, what we're seeing here for the one is trial number one. So, the number of cubes stacked on trial number one by each child, the number of cylinders stacked on trial number one for each child, and so on. Obviously, the names aren't perfect and we could uh do a little bit of extra work to put in the word trial here. We can do that either with a rename command or, as we'll see in a moment, within the pivot wider call. Let's do one more um example with the blocks data set before moving on to some of those more specifics. Now, I want to reshape the original blocks data set so that observations are defined to be a single child stacking a single shape of blocks. So, trial is no longer going to be an identifying variable in this situation. So, I won't have quite as many columns as in the blocks kid example that I did a moment ago, but I will have more rows. I'm now expecting 50 rows because I am going to have one for each child for each shape of block. So, child A will have a cube and a cylinder row, but then there will be two columns for each trial. So, I'll have something like number trial one, number trial two, etc. So, now I have two ID columns. I am going to be specifying unique observations using not just child, but also shape. A unique observation is going to be a child stacking a certain shape of block. Now, only one column can includes names of a new variable, and that's going to be trial. I want the trial number to be incorporated into the names of new columns that I'm going to create. And then the values in those columns are going to come from the number and the times column in the current blocks data set. So, let's run this and see what it looks like now, see if we got what I promised you. As I said, you get 50 rows here now with six columns. You see we have child ABC, etc. Each of those will appear in the data set twice. We also have a shape column that's going to include cube and cylinder, but now as promised, we have columns, two columns for number and two columns for time. The number the child stacked in trial one, two. Maybe I'll just show this maybe I'll just pull this up in the viewer very quickly so that we can see the um the way that we have each child exactly twice. So, you see we have all the cubes first, ABCD, etc. That's 25 of them. And then we have another 25 rows, same children, but now for the cylinders. So, you can see we got exactly what we wanted here. There are a lot of optional arguments that can be useful in pivot wider. As with most tidyverse functions, most of them are oriented around making sure that the column names that you get out of the pivot wider command are actually readable and useful to you in moving forward. Let me just pull up the help file to show you some of that for pivot wider. There we go. Yeah, there we go. So, things like names prefix, names sep and so on.

Segment 3 (10:00 - 15:00)

As I'm looking at this blocks _kid shape data set that I just created, I've decided I want to make my column names a little bit more clear that these one and two refer to trials. So, I'm going to use names sep to replace the default _separator with something a little bit more descriptive. I'm going to put in the word trial there. Let's see what we get there to take just to illustrate this. So, instead of number_1, I now have number_trial_1. So, you can see how pivot wider has now used the name sep argument that I've provided as part of its formula for creating column names. It has started by using number or time, then my separator, and then the thing from my names from argument to give me more or less what I'm looking for. Now, one natural question that comes to mind as you're working through all of this is why? Why can't I just use the original data set in its nice long format for all of my applications? And there just sometimes where that might not be possible or just might not be convenient. And one example that I like to show is a dumbbell plot. I want to make a dumbbell plot that's going to compare how kids did stacking the cylindrical blocks in their first versus their second trial. And if I'm going to want to make a GG plot with the um of blocks that they the kids have stacked in the first trial versus the number that they stacked in the second trial. I'm going to need two columns for that if I'm going to make a dumbbell plot. So, first that maybe I'll show the plot and then um talk about how I'm actually getting it with pivot wider code or how I'm getting it with a ggplot code rather. Okay. So, here's what I mean by a dumbbell plot. I have one row per child. I'm using I'm going to use the uh the blocks kid shape data set that I got in my second pivot wider a moment ago, but I'm immediately filtering it to only keep the cylinders so that I really do only have 25 observations now. The big thing that um motivated me to use a pivot wider here is the fact that I want to do a dumbbell plot with geom_segment. I want to connect the number that they stacked in their first attempt versus the number that they attacked they stacked in their second attempt. And to do that um with geom_segment, I'm going to need to specify four aesthetics. I need to specify the X and the Y coordinates for where these segments start and end. So, you can see how I've done that. I have taken my Y values to be defined by the child both for the start and the end. And then for my X values, I have put the starting value to be that the number that they stacked in trial one, the ending value to be the number they have stacked in trial two. So, this part right here is going to give me just those segments. In fact, let me just illustrate that. I'll take out my plus sign and you can just see the segments. So, there's the sort of raw plot. Now, obviously that doesn't look very good yet, so I'm going to put the plus back in, make it look a little better, talk about some of the code that I used to do that. So, um the first thing that I did was to add points onto this. So, I didn't just want the segments, I wanted to specifically label where those segments start and end and have those points to be labeled with colors corresponding to trial one and trial two. So, I have two separate geom_point calls. In the first one, I'm specifying X and Y aesthetics using the number that the children have stacked in trial one. And for the second geom_point call, I'm doing the same with trial two. Notice that I've put color as an aesthetic. Even though I have two different geoms here, I am, um, going to want a legend that is going to allow the reader to connect the two. And so, by putting color as an aesthetic, I'm letting ggplot know that it, I'm going to want a legend that way. It will assign colors automatically, so I'm also going to want a scale_color_brewer to, uh, get a slightly more accessible color palette. I like the Dark2 palette. It's high contrast and generally scores pretty well in terms of being color blind friendly. And then finally, I've neatened up my labels, taken out, uh, the label that would go on my legend to say like trial number or something cuz that's already contained in my labels. Um, I've also put in a little nod to a Homest, old Homest, Homestar Runner cartoon, stack 'em to the heavens. One last comment as I wrap up here. Although pivot_wider goes naturally alongside pivot_longer both because of the name and in terms of what they do

Segment 4 (15:00 - 15:00)

the two functions really do not have equal importance. pivot_longer comes up all the time and is really an essential tool in your toolkit even if you're just sort of vibe coding and reviewing code that's coming out from like a Claude or a GPT or whatever. pivot_wider is a bit more niche. Doesn't come up quite so often. It's something you need to have some familiar familiarity with, particularly as you are identifying tidy data and trying to diagnose what might be wrong with the data set so that you can either code on your own or give good prompts to your LLM assistant. But, pivot_wider doesn't come up nearly so often. It's not one you need to have the same deep knowledge of compared to pivot_longer. —

Другие видео автора — Equitable Equations

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Дайджест Экстрактов

Лучшие методички за неделю — каждый понедельник