# Matplotlib Tutorial (Part 6): Histograms

## Метаданные

- **Канал:** Corey Schafer
- **YouTube:** https://www.youtube.com/watch?v=XDv6T4a0RNc
- **Дата:** 15.06.2019
- **Длительность:** 16:36
- **Просмотры:** 209,016
- **Источник:** https://ekstraktznaniy.ru/video/11830

## Описание

In this video, we will be learning how to create histograms in Matplotlib.

This video is sponsored by Brilliant. Go to https://brilliant.org/cms to sign up for free. Be one of the first 200 people to sign up with this link and get 20% off your premium subscription.

In this Python Programming video, we will be learning how to create histograms in Matplotlib. Histograms are great for breaking your data into bins and seeing where your data falls based on those bins. Let's get started...

The code from this video (with added logging) can be found at:
http://bit.ly/Matplotlib-06

✅ Support My Channel Through Patreon:
https://www.patreon.com/coreyms

✅ Become a Channel Member:
https://www.youtube.com/channel/UCCezIgC97PvUuR4_gbFUs5g/join

✅ One-Time Contribution Through PayPal:
https://goo.gl/649HFY

✅ Cryptocurrency Donations:
Bitcoin Wallet - 3MPH8oY2EAgbLVy7RBMinwcBntggi7qeG3
Ethereum Wallet - 0x151649418616068fB46C3598083817101d3bCD33
Litecoin Wallet - MPvEBY5fxGkmPQgocfJbxP6EmTo5UUXMo

## Транскрипт

### Introduction []

hey there how's it going everybody in this video we're going to be learning how to plot histograms so histograms are great for visualizing the distribution of data where the data falls within certain boundaries it's a lot like a bar graph but a histogram groups the data up into bins instead of plotting each individual value so the best way to see what this looks like is to just take a look at some examples now I would like to mention that we do have a sponsor for the series of videos and that is brilliant dot work so I really want to thank brilliant we're sponsoring this series and it would be great if you all could check them out using the link in the description section below and support the sponsors and I'll talk more about their services and just a bit so with that said let's go ahead and get started so I have a little starting code here that you might recognize if you're continuing from previous videos but if you're not then let me give a quick overview of the code here and what's going on ok so up here at the top of the code I am importing pandas I'm also importing pipe lot from matplotlib I'm using the 538 style 4 just to make our plots look a little nicer and here is the data that I'm going to be using for this video right now I just have a list of ages here between 18 and 55 I here's some data that I have commented out in a CSV file and we'll look at this once we get further along in the video and see how to plot out more data than just this small list ok so down here at the bottom we are also creating a title for our plot we have X and y axis labels we have a tight layout which just gives our plot some padding and also we are doing plot dot show which will actually show our plot now as usual if you'd like to follow along then I will have this code available on my github and there's a link to that and it's description section below if you want to go there and copy and paste this into your editor so that you can follow along with this exact data and I'm also going to have the data CSV file that I'm using in this video as well ok so like I was saying we're first gonna look at how to do this using this list of data directly here in the Python script and then we'll look at a real word example with data that I'll load in from a CSV file so first let's

### Histograms [2:05]

look at this small list of sample data so let's pretend that we took a survey and we track the ages of all the people who respond now it might be useful to plot those ages to get an idea of which age groups are in our sample size so how should we actually plot these well off the top of your head you might think that a bar chart would be a good idea for this but if you think about it we possibly have up to a hundred different possible ages maybe even more so if you plot it out how many responses we got from each age then that would mean you'd have almost a hundred different columns which definitely isn't useful so this is where histograms come in histograms allow us to create bins for our data and plot how many values fall into those bins so to see this let's create a histogram of this list of ages that we have here so to do this I can simply say BLT dot hist and we will plot

### Bins [3:00]

out those ages now if I ran this now then it would give us a plot but really we wouldn't know what bins it's actually using so I always like to pass those in manually and explicitly so that people know what those bins are so when we specify bins we can either pass in an integer or a list of values if we pass in an integer then it will just may make that number of bins and divide our data into those accordingly so for example if I was to say bins is equal to 5 then this will divide all of these ages up into 5 different bins and then tell us how many people fell into those age ranges so if we run this then we can see

### Histogram [3:45]

that we get a pretty simple histogram here and what this is a distribution here now I personally find it a bit difficult to read these sometimes if we don't have edge colors for each bin because they all just kind of run together here so I don't know exactly how many bins there are here and here I'm a guessing since we have five bins it's two bins here and three bins here but let's add in some edge color so that's more clear so we can add those in by going back to our plot here and I'm just going to pass that as an argument so edge color is equal to I'll just say black so now let's run this and now we can see those bins a bit more clearly so let me make this a little larger and also to where you can see the ages up here at the top and let me explain what this is actually doing so we said that we wanted our data plotted on a histogram and we wanted that broken up into five different bins so it calculated those ages for us so this looks like it's between let's see 18 and like 26 maybe and then 26 to 33 and so on but what this is telling us here is that there are four people in our ages here that fall between 18 to 26 and there are four people that fall between 26 to 33 and so on and then we just have one person in these higher age ranges for each of those bins so if you pass in an integer for our bins then that's what we get but we can also pass in our own list of values and those values will be the bins and I like passing in a list of bins for this kind of data because you have more control over the exact values so for example let's say that I wanted to plot the ages broken up into groups of ten year differences so I could say right here above my plot I'm gonna say bins is equal to and I'm just going to say that we want to bend for 10 20 30 40 50 and 60 and now instead of passing in that we

### Using bins [5:45]

want 5 bins I want to say that I want to use this list as my bins so now if I run this then we can see that we still get five different bins here but that's only because we have six values here in our list so it starts at 10 and then 10 to 20 to 30 to 40 to 50 and 50 to 60 so that is five bins total so if I open this back up now the reason I like using my own bins for this kind of data is because now it doesn't have to try to guess where I want these broken up so we can see that now we have from 10 to 20 it's a lot easier to read we don't have to guess it's like 26 or something like that so we're saying from 10 to 22 people and our ages list fell into that bin there were 4 people from 20 to 30 three from 30 to 40 one from 40 to 50 and one from 50 to 60 so that's how you plot and read a histogram and we can even exclude some data if we don't want to add those ranges to our bins so for example let's say that we didn't want to include the ages between 10 to 20 in my

### Removing values from bins [7:00]

results well to do that we can just simply remove 10 from the bin and now 20 will be that leftmost value so now if we run this then now we can see that it's not even plotting out the ages from 10 to 20 there so this 19 and 20 don't even show up in our results here so this is now just giving us our results for the people who fell into these age ranges between 20 and 60 okay so now that we've looked at this small example now let's look at a real world example looking at some real data so let me uncomment what I've got here let me remove ages here so I'm just going to remove that data that is directly in our Python script now I'm going to uncomment out the data that I had down here let me cut that out and paste it here above our bins and our

### Loading a CSV file [7:55]

plot okay so I'm loading in this data is V file and I'm using this Panda stop read CSV method to do this now we've done this a few times so far in the series but if this is your first video that you are watching in the series then let me explain this really quick so we are loading in this data CSV so what this does is it goes to this see data CSV file here so let me explain what this survey data is so we have these responder IDs and this is just an ID for each person who responded to the survey so this is one person here this is another person here and then our age column here is just the age for the people who responded to this survey so this person was 14 this person was nineteen twenty eight twenty two and so on so back here we have our IDs

### Setting IDs [8:40]

variable and we're setting that equal to data and then we are passing in this responder ID key so what that does is it sets those IDs equal to all of these IDs here that are in this responder ID column and here we're saying ages is equal to data age and that is setting that ages variable they're equal to this entire column here for our ages and the data that I'm using here are the responses from the 2019 stackoverflow developer survey so this is actually real data for people who answered that survey so we have let's see over 79,000 responses here in this data csv file okay so let's plot a histogram of the ages for this data set and see what age ranges most people fall into who answered this survey so I'm going to expand the bins here a bit and I'm gonna say 10 20 30 40 50 60 we'll also cover 70 80 whoops 80 90 and let's also put in a hundred there now since we called this ages variable here the same thing that we had before we don't even need to change our histogram plot because that is still just ages there so now I should be able to run this and get some real data here from this or some results here from that data so we can see here that based on this plot that almost 40,000 of the respondents were between the ages of 20 and 30 and almost 25 thousand were between the ages of 30 and 40 now it might not look like we have data for 70 to 80 and 80 to 90 but it's likely because there just weren't many responses with those ages and compared to 40,000 responses for the 20 to 30 group it's just too small to show up but I bet if I was to zoom in on these values here then we will start to see something okay so here's 70 to 80 if i zoom in here then we can see 80 to 83 so there are some responses there but they're just being dwarfed by these numbers over here now when you have certain values that are a lot more than your other values then you can plot this on a logarithmic scale to montón to not make this look so extreme so to do this we can add an argument of log equals

### Plot on logarithmic scale [11:00]

true to our plot so within our hist method I'm just gonna say log is equal to true and now if I run this then this is plotting this on a logarithmic scale and we can see that now we do have that data visible for 70 to 80 to 90 and 90 to 100 so we actually had more people who responded to the survey that they were between the ages of 90 to 100 than the people who were between 80 and 90 so I think that's kind of interesting there now sometimes you might find it useful to add some additional information within these plots as well so for example let's just leave the histogram how we have it for now but let's say that we want to plot a vertical line where the median age of all the respondents is and I've got this commented out down here at the bottom

### Plot a vertical line [11:55]

here so let me uncomment out this median age and also I'm going to uncomment this color and this legend as well so I went through and I calculated the median age of all of the respondents and it was 29 years old so now let's plot a vertical line on our existing plot with that age so to do that just above our legend here I'm going to say PLT dot ax V line so I'm pretty sure that is stands for axis vertical line and we want that line to be plotted at the median age and now let's also I want to add in a color here and the custom color I'm going to add as this I think this is just a red color that I grabbed and also let's put in a label so that we know what this line represents and I'm just going to say age median so now let's run this and now we can see that within our histogram we now have this vertical line here which is the age median so this plot tells us a lot of things it tells us how many people are falling within which age groups who answered the survey and also where the median is for those survey results and if you think that this line is a little bit thick and kind of obstructing the data anyway then you can play around with how this looks so for example if you wanted to change the thickness there instead we could say line width is equal to two if I run that then that's a little thinner there so that's basically what these histogram plots are used for we can use these for dropping our data into these different bins and see how many values fall into these certain bins so that's what you would use a histogram for okay so we are just about finished

### Outro [13:40]

up here but before we end I'd like to mention the sponsor of this video and that is brilliant org brilliant is a problem-solving website that helps you understand underlying concepts by actively working through guided lessons they have computer science courses ranging from algorithms and data structures to machine learning and neural networks they even have a coding environment built into their website so that you can run code directly in the browser and that's a great way to complement watching my tutorials because you can apply what you've learned and their active problem-solving environment and that helps to solidify that knowledge there are guided lessons will challenge you but you also have the ability to get hints or even solutions if you need them it's really tailored towards understanding that material so they're computer science material is fantastic and I really like what they're doing they also have plenty of courses depending on what you're most interested in so they have courses in different fields of mathematics or astronomy solar energy computational biology and all kinds of other great content so to support my channel and learn more about brilliant you can go to brilliant org /c ms2 sign up for free and also the first 200 people that go to that link will get 20% off the annual premium subscription and you can find that link in the description section below and again that's brilliant org /c m/s okay so I think that is going to do it for this video I hope you feel like you got a good understanding of how to use histograms and also when it might be appropriate for different kinds of datasets these are definitely nice when we have data like we did in this video where we want to divide those ages up into different bins and get an idea of those age distributions because like I was saying before you might be tempted to use a bar plot but when you have a hundred ages like this that means that we're going to have a hundred little bars and sometimes that just doesn't tell you the information that you're looking for and these histograms are better suited for that now in the next video we're going to be learning about Scott plots so scatter plots are great when we want to show the relationship between two sets of values and see how they're correlated so for example let's say that we wanted to see how salaries were correlated with age or something like that well we would probably assume that on average we'd see higher salaries with higher ages but to be sure we can plot that with a scatterplot and see what that data looks like so definitely be sure to check out that video but if anyone has any questions about what we covered in this video then feel free to ask in the comment section below and I'll do my best to answer those and if you enjoy these tutorials and would like to support them then there are several ways you can do that the easiest ways to simply like the video and give it a thumbs up and also it's a huge help to share these videos with anyone who you think would find them useful if you have the means you can contribute through patreon and there's a link to that page in the description section below be sure to subscribe for future videos and thank you all for watching
