Python for AI #2: Exploring and Cleaning Data with Pandas

38:39

Python for AI #2: Exploring and Cleaning Data with Pandas

AssemblyAI 09.03.2023 33 346 просмотров 781 лайков

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Описание видео

Python for AI Development Course #2: In this lesson, we learn how to explore, clean and prepare data for machine learning model training. Get the code here: https://github.com/AssemblyAI/youtube-tutorials/blob/main/Data%20preparation%20and%20model%20training.ipynb Get the data: https://www.kaggle.com/datasets/usdot/flight-delays 00:00 - Intro 01:22 - Types of data 02:00 - Data documentation 03:34 - Settings up the notebook 06:23 - First look at the data 09:41 - Missing values 20:24 - Outliers 24:46 - Issues with categorical values 30:26 - Preparing the target feature 35:06 - Final prep for model training Python for AI Course: https://youtube.com/playlist?list=PLcWfeUsAys2lpJzESyeRUVvJlU6ycjr-b Get your Free Token for AssemblyAI👇 https://www.assemblyai.com/?utm_source=youtube&utm_medium=referral&utm_campaign=yt_mis_35 ▬▬▬▬▬▬▬▬▬▬▬▬ CONNECT ▬▬▬▬▬▬▬▬▬▬▬▬ 🖥️ Website: https://www.assemblyai.com 🐦 Twitter: https://twitter.com/AssemblyAI 🦾 Discord: https://discord.gg/Cd8MyVJAXd ▶️ Subscribe: https://www.youtube.com/c/AssemblyAI?sub_confirmation=1 🔥 We're hiring! Check our open roles: https://www.assemblyai.com/careers ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ #MachineLearning #DeepLearning

Оглавление (10 сегментов)

Intro

hey and welcome to the second chapter of python 4 AI development course by assembly AI I am musra torp today I'm going to show you how to prepare your data for training a machine learning algorithm and then in the next lesson we're going to learn how to use scikit-learn library to train a machine learning algorithm today I'm going to show you how to build this code on a Jupiter notebook you already learned how to set up a Jupiter Notebook on your laptop and also Google collab if you want you can use Google collab or Jupiter notebooks for this lesson it's totally up to you if you want to follow along with the course I will leave the a link to the code somewhere up here or also in the description so you can go ahead and follow along as I also code here on top of Jupiter notebooks of course we're going to be using Python and mainly the pandas library of python to do our data analysis what we're going to do is to explore the data understand it a little bit better and then we're going to deal with the problems in our data if we run into anything during exploration and also prepare the data for model training at the end I'm going to take a Kegel data set the flights data set that I will introduce to you in a second and go through it as an example so that you can also follow along so it's not only going to be theoretical but you can kind of understand habits also done in real life

Types of data

but first I want to talk to you about the types of data so if you're new to data science AI it might be a little bit overwhelming to see what kind of data sets that are out there because your data set could be a tabular data images your data set could be text files or even audio files so these all of these different types of data sets will have a different way of dealing with them exploring and cleaning them but the beginner level is to kind of work with tabular data because it's more structured it's easier to understand and also it's a good way to get some experience before you jump into other types of data sets

Data documentation

so there are a couple of steps that you should take before you do any coding and the first one is to understand where your data is coming from so sometimes you download your data from the internet maybe you download it from kaggle a University's research group or maybe if you're working at a company maybe you get it from another team so one of the main things that you have to look for is documentation because data sets are not always self-explanatory that could be there could be a column name that is abbreviated just looks odd and that you just don't understand and you know you're not always the subject matter expert in the field that you're working for in as a data scientist so that's why I always look for documentation if you're getting your data from kaggle there will be a website where you can see details for your data sets if you're getting your data from some other place on the internet there's probably going to be either a txt file or PDF file explaining to you how this data was collected what each of the columns mean what their units are so let's say if it's a length column is it in inches it isn't is it in meters or if it's a Time column is it minutes or hour so these are really important things to know if you're working with internal data in a company or maybe you made the data it is important to talk to the people who prepared the data and kind of understand how the state was collected this will kind of give you an idea of what kind of problems to look for as you're doing your exploration so it will be a little bit more efficient there so let's get started the first thing that I want to

Settings up the notebook

do of course is to import pandas and I'm also going to import numpy because you know you never know when you're going to need numpy which is more or less always and I will also show you how to set up one of my favorite settings so this is setting the pandas option of display Max columns to none so there are no limits of how many columns to be shown and I'll show you why that matters in a second let me get my data set here flightsample. csv as I mentioned I'm using the flights data set from kaggle it is it has the information of delayed flights but also non-delayed flights and we're going to look a little bit more into detail and I'm going to tell you what kind of problem I'm going to be solving with this example data set but just so you know right now if you download it from kaggle you will have a flights. csv but I sampled uh fewer amounts of the data points from that big data set because it was a little bit too big to run as an example so that's why mine is called flight sampled I will read it into a flights variable all right and then let's see how many rows and columns this data set has and that's you'd use the shape function for that so it apparently has more than a million columns a million rows sorry and 31 columns so if I try to print this data set what's going to happen is that it's going to show me all of the columns but as you can see it doesn't show me all of the rows it's because there is a Max rows limit and if I also set the max columns limit what's going to happen is if I try to run this I'm only going to get five columns so sometimes if your data set is really big and you're trying to print it to understand all the columns that are in there it will not show you all of the columns there will be a limit and if you want to eliminate that limit all you have to do is here to set display Max columns to none and then you'll be able to see all of your columns if you want to see only a couple of data set data points you can say data flights head and then I will show you only the first five data points or you can specify how many you want C and then again it will show you the first 10 data points so as I said we're going to be doing data exploration today but what does that exactly mean so what we're going to do is to look for missing values we're going to look for outliers we're also going to make sure that our categorical values and binary values are consistent and there are no problems with them and also lastly we're going to look at column types to make sure all of the columns have the correct type so if it's a number if it's a category it's a object as we as the naming goes in Python let's quickly take

First look at the data

a look at our data set we have for each row we have one flight we have the time and date of the flight which airline is running it flight number tail number origin and destination when it was scheduled for to depart when it actually departed how long it took for the flight to arrive and the arrival location when was the scheduled arrival when it actually arrived and the delay and we also have the information of whether it was canceled if it was canceled why it was canceled and also some reasons for delay so today the problem that we're going to try to solve is trying to predict whether there's going to be a delay and if there is what it caused by so as you can see we have one two three four five different delay reasons here and now we're going to change the data set in a way that when we train the model it is going to be able to tell us what is it that I uh going to be caused by so it's kind of like predicting the future um a little bit but as you can see there are a lot of columns here that maybe I will not really need as I'm doing exploration so I'm going to remove a bunch of them so I can use flights. collums to see all of the columns and then I will only select a bunch of them so if you use double brackets in pandas that means that you're going to choose a subset of your columns so I will just copy and paste all of the column names here but then I will only keep the ones that I'm interested in so I want to keep year month day of week Airline because you can imagine which airline is running the um flight Rule effect whether this flight is going to be delayed or not maybe some Airlines tend to be late more than the others I don't really need the flight number because flight number is basically manifested in origin airport destination airport and scheduled departure and Airline tail number I also don't feel like I need but of course these are decisions that you can undo right if you realize that you need some of the information you can come back and change this um scheduled departure I don't really need the actual departure time so when you think about it what's going to happen is when this model will be used it's going to be before the flight so we will not have certain information like departure time or arrival time arrival delay so these are things that we will not have access to so that's why I mainly want to keep information that I will already have at this point delay well yeah here one thing though one thing to be said about this is um if I have some missing information I will maybe I will be able to use departure time for example to fill in those missing information so I will actually keep departure time for now also departure delay I feel like taxi out wheels off schedule time elapsed time these will not be really relevant so I will remove those I think schedule time scheduled arrival and arrival time will definitely be helpful arrival delay is exactly what I need I've also keep canceled and cancellation reason and also all the delayed reasons for now so let's take a look at our data set again all right these are the only ones that I want so far I will immediately take a

Missing values

look at whether there are any missing values so take let's take a look okay so it looks like there are a bunch of missing values we are missing departure time and departure delay sometimes arrival time sometimes arrival delay also cancellation reason which makes sense because you know we um probably many of the Flies are not canceled so that's why there are no cancellation reason but interesting thing is we do not have a delay information most of the time but they there's the equal amount of time so this could be that maybe it's not a missing value it's just that because the flight was not delayed these this information is empty so to understand whether that is the case or not I want to see what happens every time arrival delay is non-existent or zero so I'll say flights let's see when arrival delay is missing we are also missing these guys which makes sense okay but of course this is only let's say how big this is only yeah 7806 times and delay reasons are missing more time so let me see whether these reasons are missing all at the same time this could just be a coincidence that these numbers are the same so just to save you some time I made this a prepared chord from beforehand um so basically I'm saying show me all of the rows where air system delay is null but also security delay is now Airline delay is now late aircraft delay is now and also the other delay is none another number call these not values like it is all right so that is 35 000 no 354 479 times which is actually exactly how many times we have missing information here so this means basically that every time your system delays now all the other delay reasons are also null and they're only null when the other ones are null or none or not a number so this is a good to know but one thing that is interesting is then I would have expected a rival delay to also have the same amount of none values so maybe let's um put this on a histogram and then it will be clear to see what's missing exactly all right so this is a histogram of when um air system delay secure delay Airline delay and late aircraft delay are none and we see arrival delay actually starts from minus 60 or minus 80 and goes only until like 10 or 15 which is interesting and let's take a look closer look here I want to see every time arrival delay is bigger than 10. okay arrival delay is 14 15 but these all look like very small numbers how what if I say arrival delay is um longer than 15. oh interesting so we do not have any data points where uh when delay reasons are none arrival delay is longer than 15 minutes what about 14 minutes still nothing 13 minutes okay there's something so this basically tells me that um when arrival delay is shorter than 15 minutes the plane is not counted as late and that's why there is no delay reason given this could be an interesting system if I worked in the company or maybe field I will probably already know this but this is something that you can learn when you're working with the data so this is good to know then I will from now on think that only flights that are that arrived later than 15 minutes compared to their scheduled arrival time will be counted as late okay so this is interesting information but I also want to see why sometimes the arrival um delay is missing so let's take a look at this again we see here arrival delay is missing um but yeah for some of these either we are missing the arrival time or we are missing even though there is scheduled arrival information and arrival time information we are still missing arrival um delay so I will set aside some of this data to kind of see to be able to look at it more clearly so I also give me all of the data points for arrival delay is none and I'll say so what could be the reason oh yes so it looks like maybe most of these are canceled that might explain why we have none as arrival time here but it does not explain these cases so let's see how many of them were actually canceled how many weren't okay a good amount was canceled maybe a better way of actually showing this could be to use value counts value accounts would tell me how many times these values these data points where arrival delay is missing uh the flight was canceled and or it wasn't canceled so okay it looks like sometimes it was canceled but most of the time it was canceled only one thousand a bit more than one thousand of the cases it wasn't canceled and we see two examples here so there are a bunch of ways how you can deal with this uh if you remember what we're trying to do here is to estimate or predict why a flight was delayed so here we see clearly this flight was supposed to land in 2026 um if you are watching from the US you might not familiar with this basically it means 8 30 PM 8 26 PM but it landed at 10 16 PM so there is one hour 50 minutes of delay there but we do not have the delay information or why it was delayed so uh what we could do here is to calculate the arrival DeLay So if we were building a model that where we were trying to predict how long of a delay there is going to be we could have definitely used these two in two columns of information to calculate arrival delay but we're trying to predict the delay reason and we cannot generate the delay reason information with what we have so here what I'm going to do is because we have more than a million examples data points and this information only covers a thousand of them I'm just going to delete them so it's easier for me to work on my data there's not going to be any dirty data in there anyways I cannot use this information at the end of the day so what I'm going to do is to save flights remove the ones where arrival delay is now so basically what you do here you create a filter you say every time the arrival delay information in the flight status set is null and by using this tilde you're saying negating it so you want the ones you want all the data points from the flights data set where the arrival delay is not missing so let's do this and then I will also want to see what the shape of my data set is so right now it is one million four hundred ten thousand two hundred twelve and before it was again one million four hundred seventeen thousand so there is not a big difference um and it's a good thing that we deleted those and you know we deleted these ones because we cannot use them and we also deleted the ones that were canceled again we cannot use them they were canceled flights all right let's see what kind of missing information I have left in my data set after this action all right we do not have any arrival delay missing information left yet cancellation reason as I said is expected and we have the delay reasons still um as missing information so what I can do actually is to just fill them in as zero because we saw that only time where these values are missing is when there is no delay there are sometimes some arrival delay less than 15 minutes but these flies are not considered late flights that's why there is no reason for delay so that's why I'm going to go and fill this information in as zero how I'm going to do that is by saying flights specifying which columns I can just copy and paste the names here a bit easier and I will use a fill na function and I will specify what I want them to be filled with so this is the information that I will get back but instead of this I will just equate my columns to what the function is going to return and then when I check my flights data set again you will see that what used to be na will be zero now all right so it looks like I'm more or less done with missing information as I said I do not need to deal with the cancellation reason one because it will not make its way to the model training phase I will remove it before then so then I want to

Outliers

see if there are any outliers in my data and to do that using histograms is a really good way and I will show you how to read histograms in a second now I'm just setting the bins to be 60 and the figure size to be 20 to 20 so we have a bigger camera so we can see all of the little histograms a little bit better and what bins means is basically the more you increase this number the more granulated information you're going to see if I'm going to make set the bins to be 20. then the granularity of the information is going to be lower that's why one number that I like to use is 60 but depending on the data set that you're working on you might need to change that all right so let's take a look to see if there is anything interesting that's going on that's not expected so I only have flight information from 2015 this is expected uh months these are the amount of flights that I have each month so for the first month I have more than 120 000 flights and the other ones are kind of similar more or less expected day of the month we see a little pattern here day of the week again with the numbers are from one to seven which is expected and yeah we see that maybe on Mondays and Thursdays there are more flights uh scheduled departure time departure delay looks normal scheduled arrival looks normal arrival time looks normal arrival delay again sometimes there are minus delays that means um the flight arrived earlier sometimes it looks uh like it's longer one thing that you need to pay attention to when you're reading histograms though um so just let's just iterate what this looks like so it basically tells you for each value of these columns how many times this value occurred so if let's say this High number is zero that means arrival delay has been zero for more than 600 000 times in this data set and the higher the value the less occurrence that you see of these values but one thing that also you need to understand is histograms only show you the values on the x-axis that exist in your data set so if it shows you here 2000 it means that there is actually a value that is 2 000 or maybe a bit less than two thousand so that's why I want to go ahead and see if that is an outlier or not I want to see if it actually makes sense that there is a arrival delay for 2 000 um minutes or hours so let's take a look uh let's take a first look at like more than 500. so I have a bit more than 2 000 data points where the arrival delay is higher than 500 and it looks like this actually makes sense uh so this would say arrival time was 12 28 and the flight actually arrived at 8 54 so that's why there is a big delay or the flight was supposed to land maybe the day before uh at 10 56 but it arrived at seven in the morning 7 44 in the morning that's why there is a big delay and for all of them we do actually have a delay reason so let's quickly check this one is this so this looks legit to me but also let's look at very big delays okay we only have a handful of these uh arrival delay arrival time but it looks like all of this information is actually checking out so let's see this flat was supposed to land on 1049 it arrived what looks like four hours but maybe it's actually the day after four hours um the day after at 2 pm so maybe that's why there is a big delay there and we also have this delay attributed to a reason so it looks like there is not actually a problem it's just some flies are actually super late okay that's good to know what else can I

Issues with categorical values

look at so do I have all of my delay reasons I have air system delay here some of them are really long security delay Airline delay late aircraft delay but as you can see I do not have a plot for weather delay and that could actually point to something so let's look at what my flight column types are and I can use D types for this um all of these delays should be numeric but I see whether delays object and object in Python basically means string and that means there is a problem with feather July so it could either be that when I'm reading the CSV file there was a little problem and pandas was not able to read it properly or maybe that might mean that there is a value in there that cannot be cast to integer that's why pandas read it as a string so what I'm going to do is to try to turn this value or this column whether delay into a numeric column and let's see what happens pd2 numeric is what I'm going to use I'm going to pass the column all right we get an error and it says unable to parse a string dash at position 107 so let's go to position number 107 and then see this what's going on I'm going to use eye lock so the location where the index is 107. and this is the row this is the flight uh and I see that weather delay is indeed Dash so when this happens you have a couple of options um you can either change this one manually and go ahead to your data set and literally change this one manually you can change this one by writing a piece of piece a line of code here or what I like to do when I see these little stray characters here and there if they're not a lot I will actually go back to when I read my data set and you can specify what values should be considered as missing so here I will just add also include Dash as a missing value so if I run this and if I basically run everything that I Ran So Far let's see if I get a plot for weather delay which I did that means that it worked so basically it counted the dash value or values as none and then it was cast to zero probably so now I'll be I am able to see uh weather delay here in the listed in the plot in the histogram and yes the type is float so that's also correct so I don't know how to do this again or this one I'll just comment them out so far we took a look at the missing values and the outliers and I was keeping some of the extra columns to help me realize or decide what to do with the missing mailers and the outliers but I don't really need them anymore so just to make life easier for me I'm going to remove some of these columns so I'll just again print all of the columns here and then choose the ones that I want to go forward with so let's select these ones I still want your month day of week Airline origin airport destination airport scheduled departure I will not know departure time when time comes to use this model so I will remove that departure delay scheduled arrival um I will note that um arrival time I will not know arrival delay I will not know canceled or cancellation reason again I will not know but I will keep the reasons for the delays for now because I'm going to use them to create a Target variable but that's just a quick cleaning to make life easier for me all right so now it's time to look at the categorical variables let me just look at the types of my columns again most of them are integers and float that means they're numeric but I have three that are categorical so let's take a look at the airline one I want to see what kind of uh names there are for the airlines Airline and what you can use for that is value counts it will show you how many times each value occurred in your column it's basically a histogram in a table format so we see there are one two three four five six seven eight nine ten eleven twelve thirteen fourteen different airlines um and yeah they seem to be more or less evenly distributed or represented in the data so what if I look at origin Airport all right I have a lot more information here let's see how many different origin airports we have 922 what about destination Airport 921 okay so this actually is a lot of data and it's because it's categorical it might overwhelm the model that I'm training at the end of the day so for now I think I will opt for removing these also from the data that I'm going to feed to my model and I'm just going to try to estimate whether what kind of delay whether there's going to be a delay and what kind of delay there's going to be only using a time day of the week and the Airline information for the flight and also the scheduled departure of course

Preparing the target feature

course um all right now that we took a look at this two it is time to calculate our Target feature so the target feature is going to be a column that goes that is going to have the information of which of these delays caused the actual delay and if there is no delay it's going to say no delay so there are a bunch of ways how you can achieve that what I'm going to do is I'm going to sum all of the delay reasons uh to see whether there is a delay to begin with or not so I will create a new column that will say all delay and then I'm going to get all of the delay reasons there but here and then I will sum them up and make sure the access is one so it's on the row level and then let's print the flights data set again to see if everything worked well all right so this is zero all delays is zero uh here is 25 because all of this summed up is 25 the other ones look correct too well you might say Mr why are you not using the um arrival delay which I removed already I think it's because remember that one sometimes is minus but if it's lower than 15 the flight is not counted as late so then I will not have the correct information but that's why I only use the delay reasons information here so next what I'm going to do is I'm going to actually use numpy the rare function of numpy here what you do is you give it a condition so where you want to do some changes uh and if the condition is met what do you want to name return as a value and if the condition is not met what do you want to return as a value so what I'm going to say is if flights all delay not with the capital D is greater than zero so if there is a delay I want the name of the cause of the biggest DeLay So we have the air system delay security delay Airline delay late aircraft delay weather delay and here so far we've seen that only one of them causes a delay but it could be that more than one of them causes a delay so maybe this air system delay causes a 25 minute delay but Airline delay causes a thousand minute delay so that's why then I would want to Mark Airline delay as the actual core reason of this delay so that's why I'm going to get the maximum value but if you only do Max you will get the number but I don't want to get the number I name of the column that includes the highest number that's why I said idx Max and again access one so this value will be returned if the all delay value is higher than one and if not no delay will be returned and I will assign this to a new column and this column will be called delay reason prints are data set all right so we see no delay here because all the delayed reasons are zero air system delay because it is 25 Airline delay is 156 so it looks like it worked I just want to see if they are distributed evenly um delay reason value count okay we have late aircraft delay is the biggest reason sometimes there is no delay and security delay does not happen very often apparently and remember we assumed that if there is delay that is less than 15 minutes it is marked as no delay so let's make sure that we also captured that assumption here so if delay reason is no delay I want to see what the values of arrival delay is oh did I remove that already good job yeah I shouldn't have actually so yes I did remove it but I fixed it now because I want to take a look at the statistic but maybe in a histogram it will be shown nicer okay so I see when the delay reason is stated as no delay arrival delay only goes up to like 15 10 to 15 which is perfect which is that's what we've seen all right so now it's time to

Final prep for model training

actually clean my data and not leave anything that I do not want in there in terms of columns I'll just copy this or maybe I need to print the columns again because I think I changed some things so the columns I have right now are these The Columns I want are year month day of week Airline schedule departure schedules arrival don't want arrival delay I don't want to delay reasons because I have them in delay reasons so everything else can go so this is my data set and before we jump into making the model I want to make sure by training data is ready so if you never trained machine learning models before yeah this might be new to you when you are training a machine learning data set you give it the X values which are which is the information you will feed it to separately than the Y values which is the information you will receive from it so that's why we separate our data set now to X flights and why flights the x value is going to have everything except Target value so the value that we're trying to estimate and why flights is going to have only the delay reason but still in a data set format if you use two brackets it's going to return to your data set if you use one bracket it's going to return to you A series object a serious object being one column or one row um and let's see what this looks like X flights and then y flights and before we jump into model creation one last thing that I want to do is to create um one HUD encoded features columns from my categorical value so let's look at X flights again the only categorical value we have is the airline and we want to do one hot encoded information from this because otherwise we cannot feed it to our machine learning model machine learning models are at the end of the day mathematical models and they need to or they can only handle numerical values so we have year month day of the week schedule departure and scheduled arrival as numerical values but I also need to have Airline as a numerical value what I can do for that is to use pandas get dummies function and then I will pass the x-flight data set in there fully and then I will receive the 100 encoded version for the categorical value so let's run that and then see what it looks like all the numerical values are the same columns are the same but the categorical column is separated into however many values there are in this data set and if the airline is DL let's say this is Delta it is one and everything else is zero so this makes it possible one hot encoding makes it possible for us to feed our data into the machine learning model which we're going to do in our next lesson so if you're going to come this far congratulations don't forget that you can get the code through the link somewhere here or also in the description if you want to follow along but from now on we're going to jump to the next lesson where we're going to train a machine learning model on this data using scikit-learn

Другие видео автора — AssemblyAI

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник