Grouped filters and mutates in R

7:10

Grouped filters and mutates in R

Equitable Equations 07.04.2026 497 просмотров 36 лайков

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Описание видео

The dplyr package allows many important operations to be done group-wise, drastically speeding up otherwise repetitive tasks. Let's see how it works! If this vid helps you, please help me a tiny bit by mashing that 'like' button. For more #rstats joy, crush that 'subscribe' button! You can find material supporting this course on my website: https://equitableequations.com/posts/2025-11-10/

Оглавление (2 сегментов)

Segment 1 (00:00 - 05:00)

Hey everybody. Today we're talking about grouped filters and mutates in R. Basically, we're in a situation where we want to subset rows or add or modify columns in a group-wise fashion. We're going to be working with the diamonds data set throughout this video that loads up with tidyverse. So, let me go ahead and execute library tidyverse right now. This set includes about 54,000 observations of round-cut diamonds and it includes a lot of different variables. I'm going to start just by keeping only a few of them just for better printouts. I'm going to keep carat, cut, and price. So, sort of three of the five C's. A quick glimpse, you can see those 54,000 observations and those three columns. Now, both filter and mutate include an optional by argument. Well, {dot} by. And this directs R to work in each group separately. So, you can kind of imagine separating out this data set into ideal cut diamonds, premium diamonds, good cut diamonds, et cetera, taking one set for each level, and then executing the filter and mutate separately, then reassembling everything. That's kind of what R is doing underneath the hood here. For instance, let's add a column to this diamonds small data set that's going to track the average diamond price for each level of the cut variable. Here's the code for that. It's a mutate just like you would expect since we're adding a column. The new column is going to call be called mean price by cut and you'll see I'm just assigning it to be the average price. But because I have the {dot} by equals cut argument, the column that R adds is not just going to have the average price for all diamonds in every row. Rather, it's going to have the average price for the respective group that the particular diamond is in. So, let's run this. I have added head to the end so we'll just get the first few rows. There we can see I've got a printout which is three of the four columns. So, let me stretch this out a little bit until you can see that. At the expense of my source pane there. All right. So, take a look at this. You can see I have all of my different diamonds. So, this row one is the same row one that we had in the diamonds small data set. We have the same columns except for this new one mean price by cut. And you'll see that some of these are the same like row three and row five. Those are both good cut diamonds. And row four and row two, those are both premium cut diamonds. And some of them obviously are different. So, mean price by cut is doing the average price of the diamonds in that level of the cut variable. All right, let me widen that out a little bit. Similarly, we can keep diamonds that are larger than their group means and exclude everything else. And we can do that without first doing a mutate. We'll use the filter command directly using a {dot} by argument. So, the code here starting on line 41 takes the diamonds small data set and retains only rows where the carat of a particular diamond is larger than the carat of all the diamonds in its group. And the reason we're talking about groups is because again, we have the {dot} by argument. {dot} by equals cut, so that mean is going to be calculated group-wise. I'm glimpsing this so that I can see the number of rows that have been saved. In this case, it's 23,539. That's about half of the original data set just as we would expect. Now, the older group_by syntax works equally well. The output is pretty much exactly the same although with a little small differences. Let me show you. In this next chunk, I'm taking that same diamonds small data set that we've been working with using group_by to group it by this variable cut. At that point, I'm imagining sort of like an Excel spreadsheet where each level of cut has sort of like a different highlighter color. Um and then once that's done, I'm doing the filter using similar syntax as previously. Notice there's no {dot} by argument anymore. That's handled by the group_by um function here. Let's glimpse that again. You'll see that the output is pretty much identical. We have the same number of rows, columns, the same first few observations. But notice that this output has a grouping attribute. So, unlike {dot} by which where the grouping attribute is thrown away after the filter or mutate or whatever is done, group_by doesn't do that by default. So, there's times when you might want to group data frame out, usually not. In general, I recommend using the shorter {dot} by syntax. You know, first of all for brevity, but in particular because of that behavior with regard to

Segment 2 (05:00 - 07:00)

maintaining groups or ungrouping. Usually, you want the ungrouped option. I reserve the group_by, the longer syntax, for situations where a little extra clarity is needed and I really want to be explicit about the grouping or where, you know, I do need the groups to persist after I'm done with my filter, mutate, or summarize function. If you want to get a little bit more into the distinctions between {dot} by and group_by, I'm going to refer you to my video on the summarize on group summaries in R. I'll throw a link to that up top. The last data wrangling task I want to show you is how to identify the top observations in a group-wise fashion. For instance, we might want to know the two most expensive diamonds for each level of the cut variable. We accomplish this with slice_max which is going to take the subset of a data set for which a specific variable is the largest. By adding a grouping variable to this, we can do that according to different levels of for instance, our cut variable. Now, one of the little weirdnesses of this new by {dot} by syntax is that for some of the dplyr verbs, it's actually by, not {dot} by. The big ones like summarize, filter, and mutate use {dot} by. Slice_max uses by. I don't have a good answer as to why they did that distinction. Um the help files and documentation say that it's a strictly technical difference and so it's not surprising that I haven't had to interact with it in any way. So, let's see the output of that. You can see that the result is a data frame with the same number of columns, three. And now we have 10 observations which makes sense. We asked to save two observations from each group and there are five groups. Note the syntax. We're starting with the diamonds data set, taking the max of the price variable. We want n equals two observations and we want two observations in each group where the group is defined by cut.

Другие видео автора — Equitable Equations

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник