# Python Pandas Tutorial (Part 8): Grouping and Aggregating - Analyzing and Exploring Your Data

## Метаданные

- **Канал:** Corey Schafer
- **YouTube:** https://www.youtube.com/watch?v=txMdrV1Ut64
- **Дата:** 14.02.2020
- **Длительность:** 49:05
- **Просмотры:** 473,687
- **Источник:** https://ekstraktznaniy.ru/video/11765

## Описание

In this video, we will be learning how to group and aggregate our data.

This video is sponsored by Brilliant. Go to https://brilliant.org/cms to sign up for free. Be one of the first 200 people to sign up with this link and get 20% off your premium subscription.

In this Python Programming video, we will be learning how to group and aggregate our data. This will allow us to explore our data in ways we have not yet done in this series. We will be able to answer questions such as: "What is the most popular social media site for each country?" We will be using the groupby method, and also some aggregate functions such as mean, median, value_counts, etc. Let's get started...

Video Timestamps:
Aggregate Column - 2:00
Aggregate DataFrame - 3:55
Value Counts - 7:51
Grouping - 12:30
Multiple Aggregates on Group - 26:00
People Who Know Python By Country - 27:20
Practice Question - 34:20
Concat Series - 37:27

The code for this video can be found at:
http://bit.ly/Pandas-08

StackOverflow Surv

## Транскрипт

### <Untitled Chapter 1> []

hey there how's it going everybody in this video we're going to be learning how we can group and aggregate our data now if you don't know what grouping and aggregating really entails then I'd really recommend sticking around for this video because basically this is what most people think of when they think of actually analyzing data in a meaningful sense so this will be the first video where we actually get some statistics back on our data sets and aren't just modifying our data frames in different ways so for example maybe you want to know what the average salary for a developer is or maybe you want to know how many people from each country knows Python or another programming language so what we're going to learn here is going to allow us to answer those types of questions now I would like to mention that we do have a sponsor for this series of videos and that is brilliant so I really want to thank brilliant if we're sponsoring this series and it would be great if you all could check them out using the link in the description section below and support the sponsors and I'll talk more about their services in just a bit so with that said let's go ahead and get started ok so before we start doing some more advanced data analysis let's start off slow and build up to the more advanced stuff so that all of this makes sense along the way so I have my developer survey data open here that we've been using throughout this series and as usual if you'd like to follow along then I have links to this code and the data in the description section below so let's look at some basic aggregations so if you don't know what aggregation means basically it means that we're going to be combining multiple pieces of data into a single result so for example if you've ever used a mean median or mode in mathematics these are aggregate functions because they take multiple values and give you either the mean median or mode of those results so if we wanted to run some analysis on our developer survey here one question we might ask is okay what is a typical salary for developers who answered this survey so that might be some good information to have if you're looking for a job and want to get an idea of what these salaries look like at the moment so to do this we can grab the median salaries of our data frame so

### Aggregate Column [2:00]

first let's look at these salaries so our salary column within this data frame here of all these survey results is called converted comp and that is converted to u. s. dollars it's actually further over here in the survey it is about right here so I'm gonna copy that now first let's just look at this column so as we've seen before we can just access the column just like we're accessing a key of a dictionary and I'm gonna grab the first let's get the first 15 salaries or so I'm gonna look at the head of the results here and these are salaries here that developers put down for this survey and these na n values here just mean not a number in this context it means that they just skipped that question in the survey okay so we can see the median salary for this survey just by running the median method on this series so to do this I'm going to go ahead and copy what I have here and now instead of looking at the head I can just run median on that series so if I run this then we can see that the median salary for this survey was around fifty seven thousand dollars so that takes all of the salaries responses from our survey from this series here and it gives us the median value of all of those and ignores the na n values now this probably doesn't give us as much information as we'd really like to have so for example different countries pay different amounts since there are different cost-of-living and things like that so it'd be nice if we could look at the median salary broken down by country and we'll look at that here in a second when we learn about grouping data but first I want to cover a few more basic concepts before we move on to grouping so one thing that I'd like to look at is

### Aggregate DataFrame [3:55]

running these aggregate functions on our entire data frame so let's see what we get if we just run this median function that we just ran on our entire data frame instead of just this single series so here I'm just going to say D F dot median so we're no longer accessing just a single column so if I run this then it might take a second to spin up here so when I do this it's going to look through our data frame and find the columns that contain numerical values where it can grab a median you and some of these might not make sense to use with the median but others might be pretty useful to us so for example we can see that the median age down here at the bottom for this survey was 29 years old and the median number of work hours per week that was 40 which is pretty standard so that makes sense now if you want to get a broad overview of your data and a statistical overview we can use the describe method on our data frame instead so if I instead run describe instead of median and I run this then this is going to give us a broad overview of some different stats so if we look at the converted comp column here then we can see a few different stats about this column so it gives us the count mean it gives us the standard deviation the minimum and then it also gives us the 25 50 and 75 percent quantiles here now this 50 percent marker is just the median value by the way and just like we saw before when we look this median value up specifically this is around 57,000 now this is in scientific notation here so it looks a little bit different basically this means that we just need to move four spots over from the decimal point so one two three four so that would be 57,000 there so this described method gives us a bunch of these aggregates in one place if we just want to get a quick overview of our data now if you're wondering why I wanted to look at the median of our salaries instead of the mean which is the average basically it's because the mean is affected too heavily by outliers it's not really a good metric to use because a few outliers can affect the average very heavily we can see that the mean salary up here if I highlight this right here if we were to count this up then that's actually about a hundred and twenty-seven thousand dollars on average but that gives us an unrealistic expectation of what a typical developer salary is because the largest salaries in our data set are just pulling up that average so heavily so in cases like that definitely want to use the mean instead I think that's a better representation or I'm sorry you're going to want to use the median instead because that's a better representation now if we only wanted to get this overview for a single column then we could just run this describe method on a single column as well and get those results for that specific series now you might be wondering what that count value is listed at the top of these described results now the count value is the number of non na rows which basically means that it counts the non missing rows so in the context of this survey and missing row just means that the respondent didn't answer that a specific question so if I look at the count for the converted comp column so I'm gonna go up here and grab this and instead of grabbing the median I'm just going to grab the count we can see here that only about 55 the 65 or 55 to 56 thousand people answered that question now I think there are about 89 thousand rows for this data so that means that there are about thirty thousand people or so who didn't answer the salary question on this survey now I sometimes see the

### Value Counts [7:51]

mistake that some people think that the count function will count up the individual values in a specific row and report how many of those values were in the column but if that's what you're trying to do then that's what we would use the value counts function for now in case that doesn't quite make sense let's look at an example to see what this looks like so for example we had the question on the survey that asked each person whether they coded in their free time as a hobby so to see all of these responses for that question we can look at the hobbyist column so I'll just access that hobbyist column here and run that and we can see that we get a series returned here and these are just a bunch of yes-or-no questions so it was just a yes-or-no question that each person answered so you might get the survey results back and you might think to yourself okay well I can see the responses here in the survey but I just want to know how many people answered yes and no so how would we do that well we can get that information with the value counts function so if I just look at the value counts and that is value underscore counts if we run that method on that series then that is going to give us a breakdown of how many people answered yes and how many people answered no as to whether or not they code as a hobby so I use the value counts all the time when exploring data and we can find out some interesting things from our survey by using this on some different fields so for example there is a question on this survey that asks each person what social media platform they use the most so if you're building an app or a website and want to keep track of the most popular social media sites then you might be interested in what the most popular answers to that question are so to view these results we can access the social media column of the survey so let me do that and before I run value counts on this let me just show you what this column looks like so this column was called social media so I'm gonna run this and we can see that respondent number one said that they use Twitter more than any other social media this person used Instagram reddit Facebook YouTube and so on now I've pointed this out in previous videos so far but if you've forgotten or if this is your first video that you've watched in this series then at the top of my notebook here I've also loaded in a schema data frame right here and this data frame tells us the exact question that was asked on the survey for each of these column names so for example if we want to see the exact question that was asked for this social media column then I can just access that schema data frame and do a dot Lok because the indexes are going to be the column names and then we can just search for social media and if I run that then we can see that the question that they asked on the survey specifically was what social media site do you use the most okay so we can see that we get a few different responses here but which of these are the most popular so to find that out let's look at the value of this series to see what the most popular social media sites are overall for these developers so I'm going to run this and then that value counts function here and now we can see here at the top that reddit was the most popular with about 14,000 people and then we have YouTube whatsapp Facebook Twitter Instagram I don't use social media was one of the answers now we also have some foreign social networks here so I've never heard of these but I believe these are Chinese characters so this is probably a Chinese social media site I don't know really Russian writing but I would assume that this is Russian writing here so this is probably a Russian social media site so it's kind of interesting seeing all of these different answers from around the world now one more quick tip if we want to see these broken down by percentage instead of raw numbers then we can pass in the normalized argument to the value counts function and set that equal to true so let me show you what this looks like so I can say normalize equals true and now we're going to get these broken down by percentage so 17% of the people said that they use reddit 16 said YouTube about 16 said whatsapp and so on okay so

### Grouping [12:30]

we can see that we have some social media sites here from some other countries so obviously this is most likely a regional thing my guess would be that the popularity of these social media platforms varies a lot based on what country you're in so how would we break up these results so that we can see the most popular social media sites for each country now in order to do this we're going to have to learn about grouping our data so again this is a topic that can be a little confusing when you first see it so let me start off slow so that we can see exactly what's going on here so first of all if we want to see specific results based on the country or based on some other column then we're going to have to group on that specific column and we have the group by function for this so what actually does it mean to say that we're going to use the group by function so the pandas documentation it says that a group by operation involves some combination of splitting the object applying a function and combining the results so I'm gonna try to walk through each of those processes one at a time so that we can see exactly how this works so again in the pandas documentation it says that a group by operation involves some combination of splitting up our object applying a function and then combining those results so let's do each of those now first just for a reference let's display the value counts for each country so that we can see the countries that have the most results for this particular survey so to do this we can just access the country column and if I run this we can see that this gives us the country that each respondent said that they were from and if we look at the value counts for this then this is going to tally up all of the unique responses so we can see that the majority of this survey was answered by developers in the United States and in second was India then Germany United Kingdom Canada and so on okay so now let's look at how to use the group by function on our country column so first we're going to split the object and then we're going to apply a function and then it will combine those results so first let's look at splitting the object now in this case we want to group all over the results by country so to do this we can simply say D F dot group by and then we will pass in this is going to be a list of columns that we want to group one and I'm just going to pass in a single column here for country so if I run this then what we get back here is this data frame group by object so what is this object and what exactly can we do with this so first let's explain a bit what this is so this object contains a bunch of groups and to better understand what this is let's take a look at an individual group that this data frame has now before we do that I'm going to set this as a variable so that we can reuse this and not have to retype our code over and also it'll be easier to read so I'm going to call this country group and I'm just going to set this equal to this DF group by and now instead of typing this every time we can just reference this country group variable here so now let's take a look at one of these groups so since we grouped our rows by country then we can grab a specific group by country name so I'll grab the group for the United States so to do this we can say country group dot get underscore group and then pass in the name of the group in this case I want to get the group for United States so if I run this cell whoops and this is telling me that country group is not defined and it's because I didn't rerun this cell up here after I set that variable so if I run this and grab the group for the United States then we can see that we get a data frame returned here with some survey results so this doesn't look like anything special yet but if I look at the country name for each of these survey results the country is listed right here then we can see that all of these responses are from people who said that they were from the United States and if I look at the group for India so if I instead change United States to India here and grab that group if we look at the country here then these are all the survey results for people who said that they were from India so that's what our data frame group by object that we saw before consists of it has broken up all of the different responses into groups by country name so this would be similar to running a filter on our original data frame so I should be able to get these same results for a single country just by doing what we've seen in previous videos and creating a filter so I could say okay I want to grab I want our filter to be equal to any time the country is equal to the United States and then I can apply this to our data frame by saying okay D F dot Lok and give me all the results that match that filter and if I run this cell then we can see over here in the country column that all of these results are respondents from the United States so if we're just looking to get information on a single country then it's very similar to just creating a filter like we did here but instead of just grabbing the results for one country group by instead splits all of these responses up by country name so now that we have all of those split up and grouped by country name now we can apply a function and bring those results back together so what kind of function would we like to apply well like I mentioned before maybe we want to see the most popular social media sites broken down by country now if you just wanted to get the most popular social media sites by the United States or by India then we've already seen how we can do this so right here I have some filtered results down to where we have the responses for the United States so we can just do what we did before where we ran the value counts method on the social media column so I could just say here at the end I could access that social media column of that filtered data frame and then I could just run value counts here so if I run this then we can see that for the United States we have Reddit and Twitter and Facebook and YouTube as the top four social media sites and if we wanted to look at these specifically for India then I could instead change that filter for India and run this and we can see that whatsapp came first and then YouTube then LinkedIn and then Facebook so these are the results for one specific country but if we were to run this on our data frame group by object then it will give us the results for all of those country groups so if it helps you with how you think about this you can imagine that it's similar to running a filter and then applying a function like we did here with a single country but when we group these using the group by function and then apply a function then it will combine those groups to give us the results for all of those unique countries so I think this will make sense once we just see this here so remember I called our group up here country group so if we come down here to the bottom then we can say okay for the country group now I want to look at the social media column and I want to grab the value counts for that column for that entire country group so if I run this then what this returns is a series with the most popular social media sites broken down by country now this actually cuts off a little early here so let me grab a larger chunk of this series to get a better idea of what this looks like so right here at the end I'm just gonna say dot head and look at the top 50 results or so if we run this then we can see here that our first country is Afghanistan and we can look at the most popular social media for that and then go down the list Albania Algeria Argentina and so on now this is actually returning a series and this series has multiple indexes it has this country index and the social media index now we haven't discussed multiple indexes in this series yet but if anyone is curious about how this works then maybe just leave a comment in the description section below and maybe we can cover that topic in a future video but the country is the first index and we can grab these just like we would with any other series so again if I wanted to grab those most popular social media sites for India for example then I could just come up here and with that returned series actually let's take a look at this again so here's the index here I can grab that series just by saying dot lok and then looking for India and we can see that those are the same results that we got before now you might be wondering well hey if those are the same results that I got before then why is this even useful and it's useful because now we can see this result with any country without running a filter on each individual country in the world so for example if I wanted to see the most popular social media sites for the United States then now instead of you know changing a filter over and over I could just you know go here and look at the United States index for this return series and now we can see those results so I think it's really interesting being able to play around with your data like this and being able to explore I really like seeing the different results for different countries and a lot of these sites I've never heard of so for example if we look at the most popular social media sites and China or in Russia then let me look at China here so we can see that yeah it does look like that was a Chinese social media site this way chat or WeChat and then we have I'm assuming this is pronounced Weibo maybe but yeah I think that's very interesting if we want to look at Russia then we can't actually say just Russia in this survey Russia was called the Russian Federation I've made that mistake before where I just type in Russia and it'll tell you that it cannot find and index with that name so this is actually Russian Federation and if we search for that then we can see I don't know how to pronounce this but the one that I thought was Russian writing before it does look like that was in fact Russian and just remember if it makes more sense for you to look at percentages instead of just raw numbers here then you can always set normalize equal to true and it will give you percentage results instead of the raw numbers so we can see that this Russian social media site here has 30 percent or 30 percent of the people from Russia said that was their most popular social network and if we go back to China then we can see that this one here at the top that has 67 percent of the developers from China said that was the social media site that they used the most so I just thought that was really interesting being able to play around with these numbers and seeing the different results for different countries and this is the kind of thing that we can do once we got these skills down within pandas and a lot of the times it's just fun being able to explore your data like this and finding things within your data that you might not have expected now bringing this back to what we were discussing at the beginning of the video we can also use this to run more traditional aggregate functions like mean median and things like that so before we looked at the median salaries for the entire survey but now let's break these down by country instead so just like we looked at the value counts of the social media column we can look at the median of the salary column and that salary column is labeled converted comp so to do this I can just grab our country group here and we want to look at this converted column and now we need to tell it what aggregate function we want to see for all these countries and I want to see the median salaries for all these countries so if I run this then we can see that our result here is that it says okay here is the median salary in Afghanistan here it is for Albania and so on so now if you wanted to for example see the median salary in a place like Germany then we can just simply come up here and this is the result that we get here and these are our indexes so the index the indexes are country name so if I want to grab a specific country then I can just use dot loke and type in the country name so if I run this then we can see that the median salary here in Germany is about 63,000 now maybe

### Multiple Aggregates on Group [26:00]

you're working on some analysis where you want to group your data but you also want to run multiple aggregate functions on your group so let's say that we just didn't want to see the median but we also wanted to see the mean as well so to do this we can use the AGG method AGG and pass in all of the aggregate functions that we want to use so to do this here I could just say let me grab where we ran our median here instead of running just the median aggregate function we're going to use this AGG method here AGG and now we're going to pass in a list of the aggregate functions so let's say that I want to get the median first and then I also want to be able to see the mean so if we run this then we can see that we get a data frame with the mean and the median salaries for every country and again just like we did before if I wanted to narrow this down by a specific country then we could easily do that just by grabbing one of these indexes here by country name so if we wanted to look at the mean and median salaries for Canada then I could just come up here and say dot Lok and then pass in Canada here let me spell that correctly and now we can see the median salary and the mean salary for Canada

### People Who Know Python By Country [27:20]

now depending on what you're trying to do you might run into some issues that you didn't quite expect so for example let's say that you're trying to figure out how many people in each country know how to use Python so before we do this to our group let's first look at how we do this with a single country using the filtering approach that we used earlier so I'm going to scroll up to where we had that filter and I'm going to copy that and paste that in down here and then I'm just going to get rid of this value counts section here so currently the filter that we have here is we are filtering the countries down to people who said that they were from India so now in order to figure out how many people said that they knew Python within this survey we're going to use the string methods that we've seen in previous videos and if you don't remember what these look like then we could do this by doing something like this we could say okay I want all of the responses for the people who said that they were from India and now when I get that result remember that this result here is just going to be a filtered version of our data frame our original data frame and now we can say okay I also want the language worked with is where they put the differences the different languages that they actually use so if we look at this language worked with column here then we can see that they list all of the languages that they said that they know and to see if Python is within this column here then I can say dot STR and use the string class on that return series and say okay we want where the STR dot contains Python so this will return true for the rows that have Python and the languages worked with and false for the responses that don't so if I run this then this just returns a series of true and false values where it tells us whether the language worked with column for each respondent contained that string of Python now if we want to actually count the number of people who know Python then we can use the sum function to add all of these up now normally you might think that some would only work with numerical data but some will also work on boolean x' it will count all of the truths as one and all the falses as zero so to find out how many people know Python then I could simply just do a dot sum here at the end and if I run this then we can see that around 30 100 people from India who answered the survey said that they knew Python as one of the languages that they work with now before when we wanted to run a similar aggregation function on our data frame group by object we simply took the same approach on our group by object so for example you might think that we could just do something like this to see all of these to see how many people knew Python from each country you might think that we could say okay well I should just be able to do this I could just say okay for this country group I want to look at this language worked with column and then see the strings that contain Python and sum those up but if I run this here then we can see that we get an error now like I said in a previous video sometimes it can be hard to read these pandas errors and understand exactly what we did wrong but in this case it actually gives us a pretty good clue as to what we did wrong it tells us that we cannot access the attribute string of a series group by object and then it says try using the apply method instead so the reason that we this error here is because this is no longer just a series instead this is a series group by object and it tells us to instead it used the apply method so when we run an apply method on a group object like this we're going to specify a function that we want to be run on every series in this group and I know that can sound a little bit confusing so let's actually see what this looks like and hopefully it'll clear this up a bit so instead of accessing this string class directly here I'm instead going to use the apply method and for anybody following along or who will download this I'm gonna go ahead and leave this cell with this error here so that you can run that and reproduce that error and then I'm gonna do the correct way in this cell so again instead of using it the string class directly on this series group object I'm instead going to use the apply method so let me just cut that out and I'll say dot apply and now we can apply a function that we want to run on each series in this group so if you've seen one of the previous videos then you'll know that if we just want a nice quick easy function then we can use a lambda function you could write another separate function if you wanted to but here I'm gonna use lambda so lambda here is going to be a series so now we can say okay well what do we want to return all right well I want to return X and then since this is a series we can say X dot string dot contains Python dot sum so again just one more time we are running the apply method on this series group and then we are passing in a function that is going to run on each one of these series and the function that we want or what we want returned from that function is the sum of any of the values in that series that contain the string Python and it's going to do that for every country since we're using this country group so if I run this then we can see here that we see okay in Afghanistan eight of the respondents said that they know Python Albania was twenty three and so on now seeing these numbers by itself isn't really that big of a help if we're trying to get an understanding of the percentage of people in each country who said that they know Python because with these results here we only see a single number we'd have to go back and forth and compare okay how many people answer the survey from each country and how many of them use Python and then we could do a calculation from there to figure out the percentage of people from that country who knew Python but we don't want to do that is too much to do manually so we want to figure our way so that we can get Python and pandas to do this calculation for us now a lot of people have asked me to put together coding problems to practice what we learn in these videos so you can think of this as practice so I'll do this here

### Practice Question [34:20]

so can any of you think of a way where we can figure out what percentage of people in each country know how to use Python if you think that you can figure that out then you can pause the video here and try to work through this yourself and it's going to combine a few topics that we've discussed in the series so far in order to do this but with that said I'm going to go ahead and move along with my solution so again if you want to try to figure that out on your own then you can pause the video and try to work that out and if you did do that then I hope that you were able to get something figured out there but if not then no worries let's go ahead and walk through my solution here so that you can use this as practice to get better with pandas so that you can do this type of analysis in the future so like I said in order to get the percentage of developers who know Python for each country we're going to use a combination of a few different things that we've learned throughout this series so far now there are probably several different ways of answering this question and if you have a different way that you answered this question than me then definitely leave it in the description section below so that people can see different approaches to this you know it's absolutely possible that there's a more efficient way than how I'm about to do it here so if there is then I'll highlight that so others can see what the best approach is but here's how I'm gonna do this so first I'm gonna grab the total number of respondents from each country that way we know the total number of people from each country who responded to this survey so I will just call this country respondents and I will set this equal to we want to grab the value counts of the countries here so if I print out what we get here we've seen this before whoops and I got an error there because I put County I meant to put country so if I look at this then these are the total number of respondents who said that they were from each country and again we saw this earlier in the video so now I'm gonna grab the total number of people from each country who know Python and we just did this a second ago right here but I'll go ahead and do this again and set it as a variable so that we have all of these steps so I'm going to grab all of that we just calculated and now I'm going to set this as a variable and I'm gonna call this you know country uses Python and then I'll set it equal to that and now let's print out that variable as well so let me go to the next line here my computer's kind of given me some grief okay so these are all the people from each country who said that they know how to use Python so now we have one variable that is a series that has the total number of people from each country right here called country respondents and then we have another variable that is a series that is the total number of people from each country who know Python

### Concat Series [37:27]

so now we need to combine these two now I'm actually going to use a method here that we haven't discussed in this series yet so if you got stuck here then that's completely understandable I probably should have mentioned this in the video where we appended rows to a data frame but we can combine more than one series together using the pandas concat phone so let's see what this would look like so I can say and I'll just call this theta frame Python DF and now I'm going to create a data frame where we can cut those two series in the one so I can say PD dot concat and now I'm going to pass in a list of the series that we want to concatenate so I want this to be our country respondents and I also want to add in this country uses Python series and now we also want to set axis equal to columns because by default it's going to try to concatenate these on row but we want to match up the indexes here so that it can cuts it that way instead so we want to say axis is equal to columns and then finally I'm also going to put sort is equal to false now if you watched a previous video this isn't absolutely necessary but if you run it without sort equal to false then it'll give you a warning saying that in a future version of pandas that it'll sort by default or sort by false on default so it's better just to go ahead and specify if you want the resulting data frame sorted or not so now let's look at this concatenated data frame here okay so now we have a data frame here where these two series have been concatenated and match up on the same index so this is a lot more useful because now we can see okay there were about 20,000 or 21,000 people who said that they were from the United States and about 10,000 people who said that they know Python so that's definitely a lot better and more useful information now one thing about this new data frame that we have is some columns that don't really relate to what we're talking about anymore we can see here that this one is just called country and this one is called languages worked with so let's rename these so that they make more sense in the context of what we're actually trying to do and we saw how to rename columns in a previous video as well but if you forgot then you can do this just by grabbing our data frame here and I'll say Python DF which is our data frame dot rename and now what do we want to rename the columns and now I'm going to pass in a dictionary here where the key is the previous value and the value is going to be the updated value so I will call this number of respondents and then I also want to change this languages worked with column here and I want to change this to B let's call this num nose python and if i run this then we can see that this looks good we have number of respondents from the united states and number nose python from the united states so that looks good to me so since it looks good i'm going to say in place is equal to true so that it actually modifies our data frame so if I run that and then look at our data frame one more time then we can see that it has been updated with those new columns now we have the total number of respondents from each country and the number of people who know python from each country in one data frame so we have all the information that we need to calculate a percentage now all we need to do is create a new column and calculate this so if you remember in order to create a new column we can simply just assign it so I will call this column PCT for percentage knows Python and now what do we want this to be equal to well if you don't know how to calculate a percentage mathematically basically what you do is you take the part and then divide that by the whole and then you multiply that by 100 so our part here is the number of people who know Python so I will grab that and say python underscore DF and access that series access that column and then want to divide that by the hole and the hole are the total number of people from that country so that is Nam respondents and now if we want this to be a whole number percentage then we can multiply this by 100 okay so if I did all of this correctly and it's very possible I made a mistake but if I did all this correctly then we should have a data frame here with the percentage of people who know Python from each country and now we can work with this just like any other data frame so let's say that we wanted to sort these results now we learned this in a previous video on how to sort values in a series so let's say that we want to sort the countries by the largest percentage of respondents who know Python so to do this I can just say Python DF dot sort underscore values and if you forget how to do any of this then you can always go back to our pandas video where we learned about sorting so in order to sort by the people who know Python or the percentage we can say okay sort by what did I call this here percent knows Python and then I actually want this to be in ascending order equal to false because I want the largest percentage of people who know Python at the top and I was about to put in place equals true first but let's see what this looks like okay so it looks like that sort worked and it looks good so now I'll say in place is equal to true so that it modifies our data frame and now we can look at our results here so we can see here that some of these are a little misleading here because you know a hundred percent of people from South Bay and príncipe know Python but we only had one person from the country who answered the survey and he happens to know Python or she so that is a hundred percent so instead let's look at the head here and grab see if we can find a country here with a larger number of respondents so okay we have 72 people from Uganda and 47 of them knew Python so that's 65 percent that's pretty good we have oh okay so this is United States that's not bad either we have about 21,000 here about 10,000 new Python so that's 48 percent so that's in the higher range that's pretty good so yeah I think this is a great way to practice working with pandas and also it's just fun being able to explore your information in this way and now that we have a data frame with all this information then we can also inspect a specific country to see what the percentage of developers are from a specific country who know Python so for example instead of looking through what if I wanted to see Japan instead of looking through all of these I could just say okay Python D F dot Lok and since our country names are our indexes here then we can just do a dot lok of japan and then we can see that we get these statistics for that specific country okay so I know that may have been a lot to take in and that we covered a lot of ground in this video we definitely covered some more advanced topics here than we did in previous videos but I hope this kind of got you a little excited to learn what you can do with pandas and the types of problems that we can solve you know when you are exploring through your data like this you're probably going to make a ton of mistakes along the way you know I still make mistakes and pandas all the time even in these videos I've made some mistakes and I have these scripted out so it definitely happens but you know each problem that we work through it's similar to this just makes it easier and easier each time to work through additional problems so if you need to go back and re-watch some of these steps in order to work through these problems like this on your own then that's completely normal you know don't think that just because this may have seemed difficult that there's something wrong with you it's definitely normal for this stuff to be a lot of information to take in and also like I said before if you have some other ways of solving the problems that we answered here then like I said definitely leave a comment with your solution in the description section below and I'll take a look at those and I'll highlight some if they are better than what I did here okay so before we end here I would like to mention the sponsor of this video and that is brilliant so in this series we've been learning about pandas and how to analyze data and Python and brilliant would be an excellent way to supplement what you learn here with their hands-on courses they have some excellent courses and lessons that do a deep dive on how to think about and analyze data correctly for data analysis fundamentals I would really recommend checking out their statistics course which shows you how to analyze graphs and determine significance in the data and I would also recommend their machine learning course which takes data analysis to a new level while you'll learn about the techniques being used that allow machines to make decisions where there's just too many variables for a human to consider so to support my channel and learn more about brilliant you can go to brilliant org forge slash CMS to sign up for free and also the first 200 people they go to that link will get 20% off the annual premium subscription and you can find that link in the description section below again that's brilliant dot org forge slash CMS okay so I think that's going to do it for this pandas video I hope you feel like you got a good idea for how to use these aggregate functions and also how we can group our data so that we can explore our data in interesting ways I would really encourage you to take some time after this video and play around with the data a bit see if you can answer certain questions that someone might have about this data so for example what is the most common education level for people who answered the survey that's definitely something that we could answer by what we learned here so I hope you feel like you got a good introduction to being able to answer those types of questions now in the next video we're going to be learning about how to handle missing data and how to clean up your data it's very common for data to have missing values so knowing how to sanitize and clean our data is definitely going to be important but if anyone has any questions about what we covered in this video then feel free to ask in the comment section below and I'll do my best to answer those and if you enjoy these tutorials and would like to support them then there are several ways you can do that the easiest ways to simply like the video and give it a thumbs up and also it's a huge help to share these videos with anyone that you think would find them useful and if you have the means you can contribute through patreon and there's a link to that page into the scription section below be sure to subscribe for future videos and thank you all for watching you
