Python Pandas Tutorial (Part 9): Cleaning Data - Casting Datatypes and Handling Missing Values

31:54

Python Pandas Tutorial (Part 9): Cleaning Data - Casting Datatypes and Handling Missing Values

Corey Schafer 24.02.2020 262 577 просмотров 6 374 лайков

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Описание видео

In this video, we will be learning how to clean our data and cast datatypes. This video is sponsored by Brilliant. Go to https://brilliant.org/cms to sign up for free. Be one of the first 200 people to sign up with this link and get 20% off your premium subscription. In this Python Programming video, we will be learning how to clean our data. We will be learning how to handle remove missing values, fill missing values, cast datatypes, and more. This is an essential skill in Pandas because we will frequently need to modify our data to our needs. Let's get started... The code for this video can be found at: http://bit.ly/Pandas-09 StackOverflow Survey Download Page - http://bit.ly/SO-Survey-Download ✅ Support My Channel Through Patreon: https://www.patreon.com/coreyms ✅ Become a Channel Member: https://www.youtube.com/channel/UCCezIgC97PvUuR4_gbFUs5g/join ✅ One-Time Contribution Through PayPal: https://goo.gl/649HFY ✅ Cryptocurrency Donations: Bitcoin Wallet - 3MPH8oY2EAgbLVy7RBMinwcBntggi7qeG3 Ethereum Wallet - 0x151649418616068fB46C3598083817101d3bCD33 Litecoin Wallet - MPvEBY5fxGkmPQgocfJbxP6EmTo5UUXMot ✅ Corey's Public Amazon Wishlist http://a.co/inIyro1 ✅ Equipment I Use and Books I Recommend: https://www.amazon.com/shop/coreyschafer ▶️ You Can Find Me On: My Website - http://coreyms.com/ My Second Channel - https://www.youtube.com/c/coreymschafer Facebook - https://www.facebook.com/CoreyMSchafer Twitter - https://twitter.com/CoreyMSchafer Instagram - https://www.instagram.com/coreymschafer/ #Python #Pandas

Методичка по этому видео

Структурированный конспект

Мастерство очистки данных в Pandas: как превращать хаос в структуру

Практическое руководство по обработке пропущенных значений и изменению типов данных для аналитиков данных, рассчитанное на 32 минуты.

Оглавление (10 сегментов)

<Untitled Chapter 1>

hey there how's it going everybody in this video we're going to be learning

Handle Missing Values

how to handle missing values and also how to clean up our data a bit now almost every data set that you're going to be working with is likely going to have some missing data or data that we'd like to clean up or convert to a different data type so we'll learn how to do all of that here now towards the end of the video we'll combine what we learned here to be able to look at our stack Overflow survey data and calculate the average years of experiences of developers who answered the survey so be sure to stay around for that and it's going to be great practice for what we learned here uh now I would like to mention that we do have a sponsor for this series of videos and that is brilliant so I really want to thank brilliant for sponsoring the series and it would be great if you all could check them out using the link in the description section below and support the sponsors and I'll talk more about their services in just a bit so with that said let's go ahead and get started okay so first let's talk about how to

How To Drop Missing Values

drop missing values so I have my Snippets file open here and we've seen this in previous videos and again if anyone wants to follow along then I'll have a link to all of these notebooks and the data in the description section below and as we've seen in previous videos we'll learn how to do some of this in our smaller Snippets data frame first and then we'll see how to do some interesting stuff on our larger stack Overflow data set uh to get this working on some real world data so for this video I've added some null values here into our Snippets data frame uh that we didn't have before so I added some extra first names here and we can see that I just have one that is uh a numpy naan which is a not a number value I also imported numpy up here at the top uh this one here is just a nonone value and then I also have some custom missing values as well this one is just a string of Na and missing so I have some nens some nuns and stuff like that uh thrown throughout this data so that we have some missing values so you're going to see this a lot uh that when we work with pandas uh we're going to have some missing data and depending on what it is you're trying to do you might want to handle this in different ways so one thing you might want to do with missing data is to Simply remove it so for our small data frame here uh let's say that we're going to do some analysis with these people in the data frame but if they don't have their first name last name and email address uh then we can't do what we're trying to do so we'll just remove the rows that don't have those values so in order to do this we can use the drop in method so let's do this and then I'll explain the results uh and go over those so all I'm going to do down here with my data frame is I'm going to say DF do dropna and we're going to run that without any arguments right now so when we run this we can see that now we only get four rows of data here and up here we had uh let's see four five six 7 so we got these four rows here uh because they didn't have any missing values now we do still have our bottom row here uh which has some of our custom missing values but we'll see how to deal with these in just a second but for now let's go over what drop in a is actually doing here now what's going on in the background is that drop in a is using some default arguments so I'm going to manually fill in these default arguments and it might make more sense why we got this specific result so by default I'll leave that one here and now I'm going to fill in drop in a again but I'm going to put the default arguments that already has and the default arguments of what this is doing in the background is it has an axis set to index and it has a how variable set to any so since this is what the method was using by default anyway we should go ahead and get the same results here and we can see that we do we get the same results as we did when we ran this up here but now let me actually explain these arguments here so

Axis Argument

first we have the axis argument so this can either be set to index or set to columns that is going to tell pandas that we want to drop na values when our rows are missing values when it's set here to index if we set this to columns then it would instead drop columns if they had missing values and we'll look at that in just a second now the second argument here is how uh we want to drop these or I guess a better way to frame that is this is the criteria that it uses for dropping a row or a column so by default this is set to any so we're looking over our rows since this is set to index and this is set to any here so it will drop rows with any missing values but this might not be what you want maybe uh with this kind of an analysis that we're doing it's okay to have uh you know missing email or last name or something like that uh but there just has to be something it can't just be an entire row of missing values so if that's the case then we can instead change this how argument to all and this will then only drop rows when all of the values in that row are missing so now if I run this then we can see that now we get back more rows than we did before because it kept the rows that had some missing values uh but not all missing values so we can see here we have an email missing but there were some other uh columns filled in and we can see that everything was missing here but they did have an email so all of the values have to missing in order for uh this to actually drop those so it looks like we are missing index of four if I go up to my original data frame here we can see that index had all missing values there okay now if I instead change this axis to columns instead of index then it will drop columns that have all missing values uh we don't have any columns that have missing values all the way down so it should just return our original data frame so if I say columns here and run this then we can see that that's what we get because none of these columns have missing values all the way down uh now if I set this back to the default and drop columns with any missing values then we'll actually get an empty data frame returned because we have one row that is completely empty uh that we saw here this uh index of four uh so for that row each column is going to have at least one missing value and if we set this to any then any column which is even a single missing value will be dropped which in this case is all of them so if I change this to any then since we have all missing values in one of these rows uh that's just going to give us an empty data frame now at this point you might be wondering uh okay well uh my data is a bit more complicated than this and I'm doing some an analysis where I want to drop some missing values but I only want

Drop Rows That Are Missing Values in a Specific Column

to drop rows that are missing values in a specific column so for example let's say that we're doing some analysis on our data and it's fine if they don't have a first name or a last name but we really need the email address and if they don't have an email address then we need to just drop those rows so in order to do this we can pass in a subset argument so first I'm going to set our uh axis here back to index so that we're dropping rows and now we want to pass in a subset argument and this subset will be the column names that we're checking for missing values so in this case it's just going to be a single column so I'm going to say subset is equal to and I'm still going to pass in a list even though this is just a single column and I'll will say email so if I run this then we can see that the data frame that we get back is full of rows that have uh at least their email address filled in and again this one down here with these na values that is our custom missing values and I'll show you how to treat those as missing values in just a bit now in this case here since we're only passing in a single column for our subset our how argument here uh isn't really doing much because it's only going to look at the email address for missing values so if an email address isn't filled in uh then passing in either any or all for our argument here uh would trigger that row to be removed so even if I put this as all it should give us the same results because we're only checking one value but we can also pass in multiple columns to our subset so what if we said okay well in order for my data to be useful I need either their last name or their email address but I don't need both so in order to do this we could just say okay they need um all of the values in last name and an email or I'm sorry there that uh I got that reversed uh they don't need their last name and the email uh it just can't be that all of those values are missing so as long as the last name or the email is there then it shouldn't drop those rows so if I run this then we can see that we get some values that don't have an email but they did have a last name and also uh we would get back some values that didn't have a last name but do have an email just like this Anonymous one here it has an email but it doesn't have a last name and again that's because we passed in all for our how argument which means for a row to be dropped both of the subset columns needed to be missing now like we've seen in previous videos this isn't permanently changing our data frame values if we want to permanently change our data frame then we'd have to add the in place argument and set that equal to True uh here within this method but we've seen that a bunch throughout the series so far uh so I don't think I'll go over that again here okay so now

Custom Missing Values

let's get to these custom missing values we can see down here uh that we have a row here that has some customized missing values so for example maybe the people who you got our data from uh didn't know what to do with missing values so instead they just passed in a string of Na or they passed in you know a string of missing like we have here so how would we actually handle these well it depends on how we load in our data in this case we've created our data frame from scratch by uh creating a dictionary and then creating our data frame here so what we can do here is just simply replace those values with an Nan value now if we instead loaded in our data from a CSV file then we could do something different but first I'll show this and then we'll take a look at the CSV file later whenever I go over to the stack Overflow data so right here at the top where we created our data frame I'm going to replace these values with a proper numpy Nan value uh so to do this I'm just going to go a couple of lines down here and we've seen this in previous videos but we can use this replace here and I'm replacing all the values in the entire data frame so anytime we see a string of Na I'm going to replace that with numpy Na and again I am importing numpy up here as MP so that's where I'm getting uh I'm able to use numpy and then I want to say in place equal to true because we actually want to modify that data frame so if I run that then that would replace those values but I'm also going to place uh replace this string of missing as well with mp. Nan values and I want to do that in place as well so let's go ahead and run this that should replace those values and now if I look at our data frame here then we can see that we no longer have that string of missing or na these are now all Nan values and now if we go back through and we run our cells where we drw in a values uh then these custom values should have been replaced and it should treat those as missing values so right here we can see what our previous result was where we got this index of six with those custom values if I rerun this now we can see that that's gone and the same with here if I rerun this then that is gone as well now if you don't actually want to make any changes and we just want to see if certain values would or wouldn't be treated as na values then we could just run the na or is in a method uh and get a mask of values uh as to whether or not these classify as na or not so let me just show you what I mean here so I could say DF do isna and this is just going to give us a mask here of values that uh or whether or not they are classified as an NA value so we can see that our row four here uh was all na values and so same thing with our row six and we can see some other missing values uh through out here as well now sometimes uh especially when we're working with numerical data we might want to F our na values with a particular value now I'm working with string data here but sometimes it make might make sense to fill your na values with certain values uh with these as well so for example let's assume that we were calculating grades for assignments or something like that and you had some assignments that were NAA because the student never turned in the assignment well at that point you could just decide if you wanted to score all missing assignments as zeros so that you could properly calculate up the grades so to do something like this uh we can use the fill in a method so for example I could say something like this if I do a DF do fill Na and then pass in a value uh just to show you exactly what this is doing I'm going to fill all of our missing values with this capitalized missing string here and if I run this then we can see that all of those missing or all of those na values were filled with this string capitalized as missing now like I said before I don't do this a lot with certain strings uh I found this uh to be most useful for numerical data uh depending on how you're doing your calculations uh but you might want to give na values a value of zero or Nega one or something like that so if it would make sense with your data and you had numerical data to replace your missing values with a zero then you could just run DF fill na0 and if I go ahead and run this then we can see that works on our data frame as well and again just like with our other methods if you want those changes to your data frame to be permanent and carry over into other cells then simply just add that inplace argument and set that to true to make that change permanent okay so now let's look at another common thing that we're likely going to need to do with a lot of our data and that is

Casting Data Types

casting data types so I have another column in my Snippets here that I didn't have in previous videos and I have up here if we look this is this age column so let's say that we wanted to get the average age of all the people in this sample data frame well right now these might look like numbers when we print them out in our data frame down here but the these are actually strings and we can see this if we look at our data Frame data types so to do this we can say DF do D types and that's not a method it's just an attribute so if I run this here then we can see that it says all of these columns are objects and when it says it's an object it likely means it's a string or a mix of different things so in the latest version of python or pandis I'm sorry they actually updated it so that there's actually a string data type now uh but I'll do a video on those pandis version updates at the end of this series since they actually released that updated version uh as I was writing this course uh but don't worry there's not a lot that's changed to where uh what you learn here will be outdated or anything like that uh it's still mostly the same uh but we can see here that our age column is a string because it's uh this object data type so if we wanted the average age uh then it wouldn't work as it is now so let's just see what this error looks like so I'm going to grab the mean of that age column and if I run this then we can you can see that right now we get an error and if I scroll down to see what this error was it says can only concatenate uh Str not int to string now that might not be the most easy to understand error right there uh but basically it's telling us that because that column is strings and not integers so we need to convert that column to numbers instead of a string now there's a caveat uh when doing this and this might throw some people off so when we have Nan values in a column that we're trying to convert to numbers then you need to use the float data type and that's because the N value is actually a float under the hood uh let me go ahead and show this just to show you what this looks like so I'm going to look up the type of np. nan and we can see that is a float so if we try to convert this column to integers then it's going to throw an error when it runs into those n values because it can't convert those so if I was to say DF and of age is equal to and now let's try to convert these to integers so the way that you cast data types here is we can just say Okay I want the age column as type and now we want to pass in the type that we want if I try to convert these to integers then this is going to give us an error because we have some Nan values uh so we can see here int argument must be a string uh not nun type so when you're trying to convert these to numbers and you have those Nan values you basically have two options here if your column didn't have any missing values then this would just work fine we wouldn't even run into this error uh but if it does have missing values then you can either convert those missing values to something else like a zero using the F na method that we saw before or you can just cast that column to a float instead now I think this would be a bad idea to convert those missing values to a zero or some other number uh because we're trying to compute the average in this case uh but depending on your data that might be what you want to do but I'm going to go ahead and just convert these to floats so those Nan values stay missing values so instead of an INT here I'm just going to convert this to a float and if I run this then that seemed to have worked so now we can look at the data types again so I'll say DF whoops sorry I wasn't typing in that cell I can say DF DOD types and if we look at this then we can see that now our age is a float object here so now let's see what happens when we try to take the average of that column so I'll say DF do mean and if I run that then we can see that we get the average value for those ages now if you have an entire data frame of numbers or something like that you're trying to convert all at once then the data frame object has an asz type method as well so you could just say DF do astype and then pass in whatever data type you're trying to cast everything to uh and just convert everything in the data frame at once but we have some mixed columns here so we don't want to do that okay so we've been looking at our small data set right now to test this stuff out but now let's take what we learned here and learn how this applies to real world data and do some analysis on our stack Overflow survey data okay so first of all uh I mentioned earlier that if we had custom values for missing data then it's a little bit easier to handle these when loading in a CSV and what I'm talking about up here is up here at the top where we replaced these custom missing values let me show you how we would do this same thing uh but loading in a CSV

Loading in a Csv

instead so I'm going to switch over here to my stack Overflow survey data let me go ahead and rerun this just to make sure that um all of this stuff is running okay so this notebook's still running that's good and again this is that stack Overflow data that we will have been using throughout the series and if you'd like to follow along then I do have a download link for this in the description section below okay so if I wanted to ignore those custom values when loading in a CSV then we can simply pass in an argument of a list of values that we want to be treated as missing so here's how we would do this if we had some custom missing value here in this CSV file then I could simply create a list here of those missing values and I will just call this Na vals and now I'll pass in a list of those so let's say that we have some values that are a string of Na and a string of missing uh so now what we could do here is just add in an argument and say na values is equal to and then that list that we just created and if we run that then we shouldn't get any errors and when it reads in that CSV then it will treat that list of values uh as missing values and give them an N result now in this survey here they did a good job of not having any weird occurrences like that so that actually shouldn't change anything okay so now let's look at an

Problem with Casting some Values

interesting problem with casting some values so let's say that for the developers who answered this survey we wanted to calculate the average number of years of coding experience among all of them now that might be a good thing to know to compare your experience against the average uh but let's look at what this uh or why this might be difficult to calculate with this data set and US calculating this solution is actually going to apply several Concepts that we've learned so far throughout this series so the column to view the answer for uh this question in the survey is called years code so let's look at some of these answers so I'm just going to look at the top 10 answers for Year's code so I will do a do head and let's look at the top 10 so if I run this then at first glance this doesn't really look like it'll be a problem we just have a bunch of integers and the number of years that different respondents have been coding so you might think that we can just grab the mean uh by of this column simply by saying okay if we just have a bunch of integers here and some Nan values that's fine let's just grab the mean of that column but if I run this then we get an error and if I scroll down here then it says can only concatenate string uh to string and we saw the same error in our smaller data set where the column was actually being loaded in as a string instead of numerical data and we should know how to handle this by now since we did it in the smaller data set so let's try that so let's try to convert to floats and then take the average so let me go back up here to the top where we ran this and instead of running the mean here I'm going to say okay well let's convert this to a float so that we can grab that average so I will say as type and we want to convert this to a float since there are Nan values so if I run this then we still get an error so we didn't get an error in our smaller data set here so if I scroll down then it says could not convert string to float and the string that it couldn't convert was less than one year so this might be something that we didn't expect here so obviously we don't just have numbers and Nan values in this column there is actually a string value that respondents could select that is equal to this string of less than one year for coding experience so let's look at all the unique values of this column so that we can see exactly what's in here in case there are more strings like this and I don't believe we've actually seen this in the series yet maybe we have I'm can't really remember uh but if we want

View Unique Values of a Series

to view unique values of a series then we can simply use the unique method so we could also use the value counts method that we've seen several times before if we want to count all the unique values uh but we don't really want to count them we just want to see all the unique values in this column so to do this I'm going to say d F and then access that years code column do unique do unique that is a method so if I run this whoops and I spelled this wrong sorry having a hard time typing today okay so if I run this then this gives us all the unique values of that column and as we'd expect there are a lot of numbers but we see that uh we also have some strings that are mixed throughout these numbers now we also have Nan values here but we're not going to worry about those we just want to ignore the Nan values because that's just people who didn't answer the question but we can see that the strings that we have throughout here are less than one year and more than 50 years of coding experience okay so those are our only string values so I'm going to replace those with numbers so that we can get an idea of the average years people have been coding so let's go ahead and replace less than one year here with a zero since that's basically the same thing if somebody's been coding for less than a year then they've been coding for basically zero years so to do this I can say DF do years code and access that column and then we can just replace that value of less than one year and let's replace that with a zero and we also want this to be in place equal to true because we actually want to modify that data frame so if I run that then it should make that replacement and now I'm also going to replace the value for more than 50 years here and this is going to skew our results a bit depending on how we want to do this I'm simply going to replace this with the value of 51 uh there may be some people who have several more uh more years of coding experience than 51 years but I can't imagine that it would be that many people who have you know coded many years greater than 50 so I'm going just going to fill this in with 51 but like I said depending on what we pick here it could affect our results slightly but not by a lot in this case so I'm just going to grab this same replace value here and instead I want to replace more than 50 years and I'm going to replace that with a value of 51 so now let me go ahead and run this and if we want to look at these unique values again then we could look at these and now it doesn't look like we have any strings in here but we can see here that this is still a dtype of object which means that it's not actually reading this in as floats so if we scroll back up here a bit oh actually I think I um overwrote that line yes I did so let's just try that again so what I want to do here is I want to convert this to a float and this is what uh gave us an error before because we had these strings in here and it didn't know how to convert these to a float but now we should just be able to see say Okay I want to convert that as type set that to a float so let's run that and we didn't get an error this time so that's good and now we should be able to view the average numbers of or average number of years of coding experience of the developers who filled out this survey so to do this I'm just going to say okay DF data frame access that column grab the mean of that column so if I run this then we can see that now we get that average back so the average that we got here was about 11 and a half years of coding experience uh as the average years for developers who answered this survey and now you can do other analysis on this as well so for example if we wanted to see the median then I could run that and the median comes back as 9 years of coding experience so hopefully that real world example helped explain why it's important to know how to cast these values and understand what's going on there uh there's always going to be data that is messy or not in the format that we want it in so knowing how to handle these missing values and cast these values to different data types is really going to be essential uh when working with pandis okay so before we end here I'd like to thank the sponsor of this video and mention why I really enjoy their tutorials and that is brilliant so in this series we've been learning about pandas and how to analyze data in Python and brilant would be an excellent way to supplement what you learn here with their Hands-On courses they have some excellent courses and lessons that do a deep dive on how to think about and analyze data correctly for data analysis fundamentals I would really recommend checking out their statistics course which shows you how to analyze graphs and determine significance in the data and I would also recommend their machine learning course which takes data analysis to a new level where you learn about the techniques being used that allow machines to make decisions where there's just too many variables for a human to consider so to support my channel and learn more about brilliant you can go to brilliant. org for/ CMS to sign up for free and also the first 200 people that go to that link will get 20% off the annual premium subscription and you can find that link in the description section below again that's brilliant. org CMS okay so I think that's going to do it for this pandis video I hope you feel like you got a good idea for how to handle these missing values and cast our data to different data types so that we can do exactly what we want to do in terms of analyzing our data now in the next video we're going to be learning how to work with dates and time series data now I've been using the stack Overflow survey data for this entire series because I love being able to show you all real world examples of how these Concepts apply but our stack Overflow survey data doesn't have any time series data that we can actually work with so I'm going to be using a different data set for the next video and I still haven't narrowed down exactly what I'll be using but I'll be sure that it allows us to do some analysis on some real world data like we've been doing so maybe we'll use time series data to you know analyze cryptocurrency rates over time or something like that but if anyone has any questions about what we covered in this video then feel free to ask in the comment section below and I'll do my best to answer those and if you enjoy these tutorials and would like to support them then there are several ways you can do that the easiest way is to Simply like the video and give it a thumbs up and also it's a huge help to share these videos with anyone who you think would find them useful and if you have the means you can contribute through patreon and there's a link to that page in the description section below be sure to subscribe for future videos and thank you all for watching

Другие видео автора — Corey Schafer

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник