Clean & Analyze Real Car Sales Data with pandas | Beginner Python Project Walkthrough
1:08:46

Clean & Analyze Real Car Sales Data with pandas | Beginner Python Project Walkthrough

Dataquest 27.02.2026 930 просмотров 37 лайков

Machine-readable: Markdown · JSON API · Site index

Поделиться Telegram VK Бот
Транскрипт Скачать .md
Анализ с AI
Описание видео
In this video, we guide you through a real-world data analysis project using pandas to explore and clean a dataset of used car listings scraped from eBay Kleinanzeigen, the German eBay classifieds site. You'll step into the role of a data analyst for a used car classifieds service, tackling the kind of messy, real-world data challenges you'd actually encounter on the job. What You'll Learn in This Video: ✔️ How to clean messy column names and standardize datasets ✔️ How to identify and handle missing or invalid data ✔️ How to clean and convert price and mileage columns from strings to usable numbers ✔️ How to explore and filter out unrealistic registration year values ✔️ How to group and aggregate data to uncover pricing trends by brand ✔️ How to store and organize aggregate results in a new DataFrame Whether you're building your first data portfolio project or looking to sharpen your pandas skills on real scraped data, this walkthrough gives you hands-on practice with: Python pandas NumPy Exploratory data analysis (EDA) Data cleaning and type conversion Boolean indexing and filtering Grouping and aggregation Recommended Prerequisites: Introduction to NumPy and Pandas → https://www.dataquest.io/course/pandas-fundamentals/ Access the Project: https://www.dataquest.io/ Video Chapters: 00:00:00 - Intro 00:02:21 - Project Brief & Dataset Overview 00:04:06 - Loading the Data & Encoding Issues 00:07:22 - Exploratory Data Analysis 00:21:15 - Data Cleaning: Columns, Price & Odometer 00:38:05 - Handling Outliers & Invalid Values 00:40:20 - Brand & Mileage Analysis 00:48:24 - Creating a GitHub Gist 00:50:13 - Q&A #Python #DataScience #pandas #DataCleaning #PythonProjects #EDA #DataAnalysis #BeginnerPython

Оглавление (9 сегментов)

Intro

Okay, so this is the data quest project lab and today we are going to walk through a data analysis project where we are going to look at some car sales data from eBay and work with Python pandas in particular. So a little bit about me, my name is Anna Straw and I am the director of curriculum here at DataQuest. So, I help make sure that we have lots of great content coming your way. We keep our existing content new and fresh, updated with library versions and everything. Um, and definitely keep an eye on your email because we do have some very exciting content coming up soon. No spoilers. But before I worked for Data Quest, I actually started my career as a high school math teacher. And I did that for eight years before transitioning into data analytics where I taught myself how to do that and I actually used data quest as my platform to learn these skills. So if you are looking for a career change know that if I can do it you can do it too. Now a little bit about you. I would love to know. Um, this particular project we're going to do today is towards the beginning of our data science path. It's kind of in the middle of our data analyst path. So, you don't need to have a lot of expertise in order to understand what we're talking about today. You should have some familiarity with Python lists, dictionaries, loops, conditional logic, and having done a little bit of work with pandas will make a lot of what we do more tangible, but I don't think it's essential requirement for um following along today. So today's agenda, we are going to talk about the scenario for this project. What is it that we're going to do? We're going to explore our data, clean our data, and then do some analysis with some simple visualizations and then potentially share the project via a GitHub gist and end with some Q& A.

Project Brief & Dataset Overview

So our project brief, we are going to act as data analysts for a used car classified surface and our job is going to be to look at existing used car sales from eBay and see if there's any trends we can spot. Are there certain brands of cars that sell for more or less money? Does the amount of mileage on the odometer matter for how much money a car is sold for? That's the kind of question that we're going to prepare for. Um, and in particular, we are using data from a German version of eBay and this is from a scraped web scraped data set. So without further ado, let's dive straight into the data quest app and we're going to be working in this Jupyter lab environment. So the very first thing we're going to do for our data exploration is import our libraries. Now for those of you who said that you have familiarity with pandas, this line will not be at all surprising. We're going to we are going to import pandas as PD. PD is the very like kind of industry standard abbreviation for pandas. Um, and now if we run this, my shortcut for running cells in Jupiter is shift enter on my keyboard. [snorts] We can see it has completed running because we have a little number one in the side bracket here. So the very first thing we're going

Loading the Data & Encoding Issues

to do is we are going to import or not import excuse me we're going to load in our data and we're going to create a pandas data frame called autos using pandas right that's our abbreviation pdread csv and then our CSV file is called autoscv and I'm going to make a mistake intentionally here and show you what happens. Um, if you want to practice along, the link for this project is being shared in the chat box and you should be able to code alongside or you can simply listen and um absorb the information that way. There is no wrong way to participate in this webinar. So, if I run this right now, I'm going to get an error. So, let's not panic when we see a bunch of red. All right. And what is the error that we get? Here's some debugging for us to do. We get a Unicode decode error. And error messages can often feel overwhelming because they're really long and oh, do I have to read all these lines? Not usually. The trick is to scroll all the way to the bottom to the last line. And sometimes you need to go back a little bit further up in the error to see what's happening, but this is always the best hint. And it says unicode decode error UTF8 codec can't decode bite 0xdc in position 1059 invalid continuation bite. So this can sound like a little bit of technical mumbo jumbo, but the key here is UTF8. When we read in files, there is a bunch of different sort of flavors of way fi ways files can get read in. And UTF8 is the common encoding method for most files. But occasionally there are certain characters, maybe it's a um non-English character, maybe it's a funky emoji or something that UTF8 doesn't know about. it doesn't know what to do with it. It takes all those zeros and ones from the back end of the computer and tries to turn it into readable text, but it hits a certain character it doesn't understand. So, there are other encoding methods. And let's try the second most common encoding method, which we're going to specify as encoding equals Latin 1. And my tip here is if you ever do get a Unicode decode error and UTF8 isn't working, give Latin one a try. And I would say nine times out of 10 it will work. So let's see if in this situation it does in fact work. I'm going to press shift enter to run the cell. And we can see it runs. We don't get an error because we have a number in our bracket here. So, Latin one solved our encoding issues. So, now in the same cell, now that we've debugged, I'm going to

Exploratory Data Analysis

display some information about our data. We're beginning the data exploration phase because we can't do any analysis until we understand what is our data actually giving us. So theinfo method is a great first glance to see how much data are you dealing with, what types of are the column names, um all of that. And then the head method will display the first five rows of the actual data so we can see uh kind of firsthand what those values look like. So in ourinfo method, that's what we see in this output here. I see we have 50,000 rows of data. We have 20 columns. And when I see the info method, there's a couple things in particular I look for. I look for how many nonnull values we have. So null values are usually represent missing values. And if you have a lot of missing data, your analysis can be less correct and less powerful. So looking here, 50,000 out of 50,000. Great. There's no null that now there are no null values here. Um there are a couple null values in the vehicle type there. It looks like our largest null values is in the not repair damaged column. So, we might want to keep an eye on this one. Um, just kind of keep it in the back of our head as a potential data cleaning step we would need to consider. Uh, but for today's analysis analysis, we're actually not using this column. So, it's okay. Uh, the other thing that I like to look at is the data type. That's this DT type column. And object generally means it's a string. Um so anywhere where we see object, the name of a car for example, it's going to be a string. The name of the seller of the car is going to be a string. Um int 64 means it's going to be an integer. And that makes sense for the year of registration. Um same thing for power. PS is the amount of power the car has and the unit here is PS. The month of registration is an integer as well. The number of pictures associated with the eBay listing is an integer. Postal code is an integer, but this is a little bit of a gotcha because can you have a average postal code if you want to take the sum of all postal codes? No, that doesn't make any sense. So for our analysis, we don't need to worry about this. But if you are working towards a machine learning model, for example, you would want to be very mindful that this is not actually calculating as an integer. Um, a couple other just kind of column identification type things for the good of our webinar today. Um, the price is how much the car sold for on eBay. And this is interesting. Right now, the price is listed as an object. But price should be a number. So when we look at the first five rows of the data, I want to see why is the price being labeled as an object. Um, and then what else do we have here? the odometer. This is the other thing we're going to focus our analysis on today. And the odometer is also listed as an object. So, the number of miles on the car should be a number. So, the fact that it's an object, that's another kind of odd one. Um, a couple other things just to understand the data set. This data was web scraped from eBay. And so several of these columns are giving us information about that scraping uh process. So when we say date crawled, that is the date that a bot crawled eBay to pull the information about a car. Very similarly the oh where's the last one? Last seen our final column here. This is the time the bot was last seen scraping this particular eBay page. All right. So now let's go ahead and look at our first five rows of our data set. So we can see our date crawled looks like a date. We have some long string type names for the name of the car. The seller looks like it's very often listed as private. And remember this is German eBay, so the words may be in German. Um, offer type is angbot. I don't know what that means. I don't speak German. [snorts] And then the price. Remember, this was an object type. And now I'm seeing why it's an object type because there's a dollar symbol and a comma to denote the thousand's place. So, we're going to have our very first data cleaning task queued up. We are going to remove the dollar symbol as well as the comma and convert this into an integer so we can do some descriptive statistics with it. All right. Gearbox says manual or automatic. Uh Edu, thank you. Angabot means offer in English. Okay. So when we see the word angabot here, it is just saying that there's some kind of offer available. Maybe it's offer versus excuse [clears throat] me um set price or something and yes of our Nashri hopefully I'm saying your name correctly you are correct we do see the odometer as well we noted was an object data type and now we're seeing the reason why is it has this kilometer marking as well as the comma so this is another one we're going to need to remove some string characters from to convert into an integer. Okay. And the rest of this is looking okay. Um for our analysis, most of these columns we're not going to touch very much. Although this is a great data set and if you want to move on to next steps after the webinar, I do have some ideas and in that case some of these columns might be useful for you. [clears throat] Okay. So the very next thing we're going to do is just a good practice. If we look at the names of our columns. So for example, vehicle type. This is using something called camel case where when we have multiple words for a single name, the words are mushed together without a space and new words are denoted with a capital letter. So it's called camel case because the capital letters kind of create a bump to show us the new words. But Python the best practice is to use snake case where separate words are separated by an underscore. So to keep in line with best practices we are going to change our column names. Now, the first thing we're going to do to help us with this is just use the columns attribute to list all of our columns very easily for a copy and pasteable way to edit each individual column name. Then we're going to take those columns and I'm doing this part for us, but I encourage you to copy and paste what we have from the columns list here into here and then actually make the um changes. So instead of date crawled like was originally in the data set, replace that capital C with underscore lowercase C so that all of the columns will be in snake case. And it is important to keep the exact column numbers in the right ordering because pandas isn't going to know if you rearrange the column names. It's just going to put the incorrect name with the incorrect column. So we keep all the column names in the same order. And now we assign it back to the columns attribute. And when we print the first five rows of our data set, now we can see those column names with snake case instead. So that's just a little bit of a best practice for us. And now continuing our data exploration. I really like looking at the describe method. The describe method will give us any summary statistics. And because so many of our columns are currently not numbers, they're string types. I'm using this parameter include all. And this will include every column in our data set, whether it is a number type or not. If we don't include this, actually, let's see what happens if we don't include this. I'm a big fan of experimenting. And so we see that we're missing most of the columns because these are the only numeric columns that we can calculate mean, median, standard deviation on. But when we include all, now we see everything. [clears throat] So if you're new to this, you'll see that there's a bunch of nan n which stands for not a number. And this simply means that these are string columns and they can't calculate a mean. It's not an error. It's just we don't have that information available. So whenever I look at the describe method for object columns, I like looking at the number of unique values. So for example, for the date crawled, it makes sense that there are a lot of unique values. Uh similarly for the name of the car, it makes sense there's a lot of new, excuse me, unique values. But these ones with just a couple unique values, these are ones that I might want to keep in mind. Um, because maybe they're not super valuable if there's only two different values. So seller only has two different values. Offer type AB test, this makes sense to have two different values because it's either going to be, you know, test or control. The gearbox makes sense because it is going to be automatic or manual. Uh, this one's a little interesting to me. The odometer, which is still listed as a string data type, only has 13 unique values, which for something that's supposed to be numeric is a little interesting to me. So, we'll just keep an eye on that. And then scrolling through our other unique values. This one's a little bit tricky to see, but the number of photos and postal code these are just I don't know. These ones are standing out to me. Um number of photos in particular, all of our descriptive statistics are they're all zero. So is there any nonzero values here? Something that we're going to explore. All right. And looking through our numeric data types now, um let's first of all look at our mins and our maxes. This will very easily help us spot if there's any outliers. So for our minimum value and our maximum value for registration year, so what year was the car registered? The minimum value is 1000. Now, I'm not a car expert, but I'm pretty sure cars didn't exist in the year 1000. So, we're going to want to explore that. And similarly, the year 9,999 is way in the future. So, this maximum value for a car registration year is likely a data quality error that we need to correct. So, what else do we have? Power. Maybe this maximum value is a little off, but we're not going to include this column for our analysis today. So, I'm not concerned about it. If you do want to do further analysis, I um on this column in particular, I would investigate that. Registration month makes sense to me. Um, and we don't have many other numeric values. So that's it for now. So let's start picking out those issues we wanted

Data Cleaning: Columns, Price & Odometer

to investigate further. First one is that number of photos column where everything looked like it was a zero. I want to see the value counts and see is there anything that isn't zero. And um if you are newer to pandas, what we're doing here is we're taking our autos dataf frame and we are saying we just want to look at this one column. And this bracket notation is very nice to pull out individual columns. Then the value counts method is built in. And we can see in our output here the number of photos all 50,000 rows are zero. So this is a useless column for our purposes. We can drop this column completely. Um some other ones that we are going to remove are the seller because there were only two unique values as well as the offer type because it just it's not giving us any information that we need. So to drop columns from a data frame, we are going to say our data frame is now going to be different. Take our old data frame or our existing data frame and drop columns that are in this list of columns. And then to specify that we are dropping columns and not dropping rows, we use this access equals parameter. And if we say one, it means we're dropping columns. And if we say zero, it means we're dropping rows. Um quick uh quick clarification about data frame versus data set. So in pandas, you are going to be working with data frames. That is the way that pandas pulls in and most easily parses tabular style data. So if you've ever worked in Excel with the columns and rows there, pandas doesn't have Excel. So it creates a data frame to make columns and rows happen. So when I say data frame, you can visualize that tabular data. Um and when I say data set, it's It's sort of a synonym but dataf frame is a little bit more technically accurate when we are working in the code environment because this is a data frame based on a data set. Hopefully that makes sense. Um so we're going to run this just drop our columns really quickly. And now we're going to take care of the price and odometer columns that should be numeric but are currently strings. So I'm going to copy a chunk of code and we're going to walk through it. This might look a little intimidating. Um but what we're doing is we are going to say our price column. We are reassigning some work to our price column. And the work we're going to do is looking at the existing price column. And for that column, we are going to ensure it's a string for now. It's important that we're looking at strings. And then we are going to use this replace method where this first argument is saying we are going to replace the dollar sign and the second argument is what we're going to be replacing it with. So we are going to replace the dollar sign with nothing. So this is the way to remove characters from a value. And we are doing the exact same thing to replace our comma. We are replacing our comma with nothing. And then the final thing we're going to do here is change the type because it is currently a string. And we are going to now say the type will be an integer. This is called method chaining where how it's presented here makes it very readable to go line by line. It does the same thing if we delete all the spaces and have it on a single line. And I want to show you that just because I know for myself when I was new to Python, method chaining felt kind of intimidating. But what it's doing is it is a easier way to do several things at once. And the chaining comes from dot dot. We're doing this dot, then we're doing this dot. So, one line of code can do three things at once for us. It's replacing two different things and creating it as an integer. Um, and so now when we look at the head, and Dave makes a good point, many European countries um, substitute a comment for a decimal point. And this is why it's really important that we looked at our head method earlier today because we were able to correctly pinpoint the exact characters that were causing our price to be a string instead of a number. So if we didn't do this peak under the hood, we might pull the wrong characters out. Okay. So now when we look at our price and just the first couple values using our doad method, we see that they are in fact clean numbers and the data type is an integer. So now we're going to do the same thing for odometer, but I'm going to give you a second to check your understanding. So what do we want to do to odometer to turn it from a string to a integer? There are two things that we need to take care of in chat. I see it. Remove the km unit and the comma. Yes, we're going to use the replace method. Exactly. So, it's almost the exact same code. We could copy and paste from here and simply change the specific characters. So, we're going to replace km with nothing and replace the comma with nothing and then turn it into an integer. H. And to help us specify that odometer now is in kilometers because we are removing that information from the unit itself. Um, we're going to rename the column. Instead of odometer, we're going to call it odometer_km. Access is one. Remember, this is for our column, not our row. And then in place equals true just helps pandas know that we're not adding anything new. Just keep everything how it is. Just take this new name and drop it in. And now when we look at the head of our odometer km column, we can see yes, everything is a nice number with an integer type. All right. Um, so what we want to do now is we are getting prepared to do our analysis. We've cleaned what we need to clean for the time being because we're going to be exploring price and odometer specifically. But there was one extra thing about odometer that we wanted to explore from our data exploration phase. And as a reminder of what that was, it was the fact that odometer only had 13 unique values. If it's a number, why are there only 13 unique values? Odometers can be vastly different for different cars. So, let's look at the value counts of that. So, we've seen this before. We are using our autos data frame. We're pulling out the odometer column specifically and using a value counts method. And we can see that yeah, we have 13 different values. um that cover all of our data. And my best guess as to what's happening here is that when you list a car on eBay for sale, at least in the German market in 2016 when this data is from, you had to pick from a drop down of options. So this is fine for our purposes. It's fine. um a difference between a car within the 125,000 km range and the 150,000 km range. We're not losing an amount of granularity that is essential. So as far as data cleaning goes, we're not worried about that. But it is good to verify what exactly was happening with that column. Okay. And now that our two analyses focuses are in numeric form, we are going to reook at the describe method. So I'm going to look at the price column first and look at the descriptive statistics for that. And then I also am going to look at the value counts for price but only the first couple because we do have lots and lots of unique values for price. [snorts] Uh just to start getting a idea of what's happening here. All right. So I'm seeing first of all our descriptive statistics are showing up in scientific notation which is slightly annoying but uh we can still get what we need to get from it. The first thing I'm seeing is the minimum price. We don't need to be experts in scientific notation to understand that the minimum price is zero, which a car being sold for $0 does not make sense. The maximum price, however, um is a one with eight zeros after it, which that is a large amount of money for a car. And I think we are going to need to look at our maximum values to see if these are outliers because I cannot imagine a world where a car is being sold for this amount of money. Um, so now looking at the first couple values and value counts for price, we see that about 1,400 instances have a price of zero. Now, if we remember, our data set has 50,000 rows. This is a pretty small number of instances in the grand scheme of things and I think it probably makes sense to drop the cars that are listed as sold as $0. Maybe it means the car didn't actually sell. Maybe it means I don't even know what it means, but for our purposes of acting as a used car sales analyst, this isn't what we're doing our analysis for. So, we're going to remove it. And now let's look at the big prices, those outliers potentially. And the way we're doing this is we're going to look at our value counts. But we are going to sort the index by descending order. And the way we have to specify descending order is by saying ascending equals false. And look at the first 20 top values here. Okay. And we can see that we have some astronomical prices. Um, everything here is a car being sold for millions of dollars. Um, and these top values, I am going to make the executive decision that we're going to remove them. This is where the art of data cleaning comes in. Maybe you have a different threshold for what you want to keep and remove. But for me, I'm saying if a car is being sold for over a million dollars, it doesn't have a place in our analysis. Margaret has a great clarifying question. Is it dollars? If it's a German eBay e eBay site, it is probably in German um money. But even so, given our data set has 50,000 rows and about 10 15 values are over a million, even though I don't know the exact currency exchange rate of dollars to uh German euros. Is that what Germany's on? Sorry, Europe. Um, these probably don't help our analysis at all. And they could be luxury cars, but again, because 50,000 rows of data, 15 of them are in this particular range, the judgment call is we don't need them. Um, but maybe you want to explore these top cars. Are they luxury cars? Maybe there's an analysis there you could do. Deutsch mark. Thank you. My um American geography is failing me. All right. So now let's do the same thing for the value counts of our lowest prices because we did have some small values here. So same thing, but we doing ascending equals true. Okay. And we can see that we have several prices that are very low. A price of $0. We did see this before about 14,000 instances of a car being sold for no money. And then there are a handful of cars that are being sold for very very low prices. So we are going to remove our outliers. Um and judgment call. This is the judgment call I made. I decided that we are only going to keep cars with prices between $1 and $350,000. And these will remove our extreme high outliers as well as our zero outliers because it does seem like there are a handful of cars being sold for a dollar. I don't know the story there, but I think we can't completely remove them. Um, and so now that we are removing those extremes, let's reook at our describe method. And because we got rid of those very large values, we are no longer in scientific notation, which is nice for human readability. And we can see that yes, our maximum value is within our range. Our minimum value is now one. And we haven't lost an extraordinary amount of our data. We still have 48. 5,000 rows of our original 50,000. All right, if you're following along with my solution notebook, I'm going to skip the next portion here um for time so we can get to the analysis part that everyone was most excited about. Um, yeah, I'm trying to best use our time here. Let's see. Let's see. Yeah, all of the date crawled um information we're going to skip. Last scene But

Handling Outliers & Invalid Values

Last scene we're going to skip. But registration year we are going to take care of because remembering our exploration um we had when the car was registered the minimum value of a,000 and the maximum value of 9,999. These years don't make sense for cars. So we need to take care of them. And we're making a judgment call here. Once again, I'm not an expert on cars, but we are going to say that a valid registration year for our purposes is anything between 1900 and 2016. And 2016 was picked here because the data set ends in 2016. So, um, this line is going to help us see how much of this data is not in this range. Because if we have a large amount that's not in this range, we would need to consider if we do need to keep it. But we can see here that we actually we only have about three 4% of the data that is outside of this range of this range. So it's okay. What does the tilda symbol do? This is the not symbol. So what cars are not in our valid range? And there are about 4% of cars that are not in this valid range, which for dropping that's okay. That's a small enough number of rows that we can remove them just fine. [clears throat] So let's go ahead and remove them. And we can see now that we are reassigning our data frame. Whereas in this row before we were just doing a quick calculation, now we are actually saying okay, we are only including registration years. between 1900 and 2016. And just a quick peak of the value counts for that registration year to see if it makes sense. Okay. Yeah, this is looking fine. Um, the top registration years look like 2000 has about 7% of our data. 2005 has 6%. And all of these make sense for years of cars that are being sold as used. Okay, so now

Brand & Mileage Analysis

we are going to begin our analysis. And the first thing we're going to do is look at what brands are most popular to sell in a German eBay market. And we're going to look at our value counts. We're using normalize, which normalize basically just turns it into the percentage of the data versus a raw number. And so the most popular brands we can see Volkswagen has about 21% of the sales. BMW has 11%, Opel has 10%. And they decrease from there. Now there are some really really small values um less than a percent with Volvo Mini Cooper um and so on and so forth. Now you could keep all of these for your analysis but we are going to keep just the most common brands to help the used car company that we are acting as analysts for get the most information um for actual decisionm. So we are going to keep common brands as the brands where it has more than 5% of the instances um of the total brands being used. And we can see the list has shrunk drastically. We have Volkswagen, BMW, Opel, Mercedes-Benz, Audi and Ford. [clears throat] All right. And now what we're going to do is we are going to basically find the mean price that was used for each of our common brands. So we're creating an empty dictionary and we're going to say that for every brand in our list of common brands, we are going to pull out the brand name. And then price for that brand. And we are using the mean method to pull out the average price there. And then we are converting it back into an integer just to make sure. Let's take a peek at what this looks like. All right. So we can see that Volkswagen has a mean price of 5400. BMW 8 um 8,300. Looks like the cheapest one here is Opal at 29. Um, so this is kind of cool. But the final thing I want to do here is actually turn this into a very quick visual. And to do this, we are going to convert this information that is currently a dictionary. We're going to convert it into a data frame. Remember back to our conversation about data set versus data frame. Data frames are the easiest way in pandas to calculate, do information, um do a quick little visual because that is kind of the core of pandas. So to turn it into a data frame, we are going to pull out a series of the prices and then we're going to say that the series name is going to be our rows and then our columns are the mean price. And now look, we've turned this dictionary into a very prettyl looking data frame. But bonus, we're going to visualize this. Hopefully I'm going to do this correctly. Okay. So, we're going to say actually we're going to um say auto dataf frame equals. So, we're assigning the dataf frame to a variable name. And then we're going to say auto dataf frame. plot. And if we plot this right now, we can see we get a line graph which doesn't really make sense for our data. Um, so we're going to specify the kind is a bar graph. And with one single line of code that you don't need to know any additional Python libraries for, we can see that there is this pretty nice visualization showing the um different pricings. So if you are preparing some data for a stakeholder, I always recommend including a visual when you can. Okay. Now we are going to do a very similar thing for our mileage. And so again we are creating an empty dictionary. And for this dictionary, we are going to loop through the brands in our common brands and basically find the average odometer reading for cars that were sold for each brand. And then we're [clears throat] going to create that into a data frame. And let's display the data frame. Okay, looking just at the raw numbers here for our data frame, it seems like mean mileage is not varying as much as it did for our price. Price vary quite a bit with Audi and Mercedes-Benz and BMW kind of being these higher price points and Volkswagen um kind of being in the center whereas Opel and Ford are a little bit lower price points. Whereas mileage, everything is about the same. Um, we can do a very quick visualization again. So, let's do brand info. plot. And we want a bar. And yeah, we can see everything kind of about the same. So, let's give our stakeholders a very nice little snapshot at the end where we're just putting everything together into a small little data frame where we're creating a new column in our brand info called mean price and then we're assigning it the values from our mean prices. And now we can see everything next to each other. So [clears throat] we can say okay um in general if we want to sell higher value cars BMW, Mercedes-Benz and Audi are going to be those cars. But if we want to kind of hit the sweet spot, maybe Volkswagen would be a good contender cuz it's not necessarily the higher end, but it's not the lower end like the OPPO and the Ford. [snorts] So this is where our analysis ends today. But this data set is so robust and there are so many more opportunities with it. So I have a couple next steps. Um the next one, what about common brand and model combinations? So we looked just at brand, but we do have information about the model as well. So does the price vary drastically for common brands depending on the model? Um, what if we split the odometer reading into groups because we did have those 13 distinct groups and then see if the average prices vary drastically based on what the odometer reading is. And another column that we didn't use today, but could be interesting to evaluate is cars with damage. There's a column that says, is the car damaged or not? Are damaged cars sold for much less money? Some interesting ideas here. All right. Um, and we unfortunately I don't think we have time for working on the GitHub gist. Yeah, we do. Let's do it really quickly. So coming into your

Creating a GitHub Gist

Jupyter notebook, my favorite way to do this is in the side panel, you can collapse or unclops based on this little folder symbol. If you rightclick your notebook and say open with editor, it will give you the raw code for everything you've written already. Then you're going to go to gist. github. com github. com and you'll get to a page that looks like this. You'll need to sign into GitHub and have a GitHub account to do this. Then you're going to give your gist a description. Um, I'll say project lab demo for eBay cars. Then you're going to give a file name including the extension. This is very important to give the extension so that GitHub knows how to render your file. So, I'm going to say eBay dot and then because we worked in Jupiter, the extension is I pi NB. And then you're going to paste in the raw code that we just copied. Then you can click create a secret gist if you want it to only be accessible to you or you can create a public gist which anyone can see. And drum roll please, we can now see that our entire Jupyter notebook is correctly rendered in GitHub. And this is something that you can share um with other people. And if you scroll down to the bottom, you'll see there's an area to leave comments. So you can ask people for feedback. And this is I love this. It's just a very nice, clean, simple way to share Jupiter code in particular. Okay. And now we have some time for

Q&A

questions. So I know we're nearing the end of the official webinar time. So if you do need to hop off, I completely understand and I want to thank you for joining today. I always love talking through data projects. Um but I will stick around for about 12 more minutes here to answer some questions. All right. So, if you do have questions for me, make sure that you're asking them in the Q& A box. All right. Um, so Richard asks, "What is the best Python IDE to use for data analysis? " So, best I think is subjective. I personally use VS Code because it's free. It's commonly used and understood. The documentation is good. It's very flexible. Um, but my colleague uses PyCharm and he really likes it. So, I think those are the two top contenders. You can also consider Google Collab if you want to keep everything completely online. Um, but I know that when I was transitioning from learning in an online app environment like Data Quest to working locally, it was a little bit of a learning curve. Um, and if you are finding yourself asking that type of question, I highly encourage you to explore an IDE. Um, it the interface might feel overwhelming at first, but it's essential to familiarize yourself with it and start building that fluency. All right, Jennifer asks, "Do we have to import it as PD? Can you leave it as pandas? " So, this is referring to our import statement. can. So, for example, um if I delete this part and rerun that cell and then um here I think we'd have to do this. Yeah. So, you can do it, but it's a little more cumbersome to have to write out the entirety of the word pandas every time, especially when you start doing some method chaining uh like we had later on. So industry standard really is to use this abbreviation. Um you'll see this for many common data analysis libraries. So for example, NumPy is NP. Um Seabour is a visualization library. This one is SNS. I don't know why it picked SNS. Um and then another one. This isn't correct syntax, but mattplot lib for plotting. This one is plt. So, familiarizing yourself with these very common abbreviations will help you understand other people's code more easily and help other people understand your code more easily. And Adu asks how to install pandas. Um, can you ask that question in the Q& A, please? I'll try to get to it in there. Um, all right. Ranji asks when column name said nonnull count. So, I think this is about theinfo method here. Why the values come in reverse order? So nonnull means that it is populated with information. So the date crawled has 50,000 rows populated with information. Name Null would be zero. So it's um it's showing us how much information is there versus not there. Hopefully [clears throat] that answers your question. If not, let me know. All right. So Bode asks about the Unicode decode error. So this is talking about the encoding. Um and when I have done webinars about or that have had to do encoding before, this is a pretty common sticking point. Um, but I think the best way to think of it is sort of translation where if we're translating from English to Spanish, for example, we need to specify we are starting in English and then we know how to get to Spanish. But if we're if the code or if the um words we have are not in English and they're in German, for example, and we say, "All right, take what we have and turn it into Spanish. " If it's not English, it doesn't know how to turn this thing into Spanish. So encoding, if it's not UTF8, which is kind of the standard, um we need to specify what we are starting with here. So that's why we are including the encoding. If you don't include the encoding, you will get an error and adding this in should solve the problem. Edu asks, how do you make easily the observations? So this comes with experience and I think it's why doing projects as you learn is so important is following this process of learning how to explore your data. Learning how to see what's important from your data. This is not something that I knew from the start. It is something that I developed over many many projects as a learner and then also as a teacher going through these project labs. Um, but if I were you, the couple takeaways I would write down about what to observe, looking at non-null values, looking at data types, unique values for string columns. looking at min and max values for numeric columns. And with just that small checklist, you are already going to get a great starting point of how to figure out where your good data versus your bad data is. And Muhammad asks, will there be a model building session on this analyzed data? Um, that's an interesting idea. Would you mind submitting a webinar feedback with that as an idea? Because I'm always looking for new webinar ideas. And Allison asks, "What is the benefit of gist versus GitHub proper? " Uh, how fast it is. I like gists for just slapping some code online in a rendered format. Um because if you use GitHub proper, you have to create a repository, you have to um create a branch for your new code, you have to push, you have to do a pull request, and it's just the workflow is a little bit longer. And if you're new to GitHub, all of that can feel a little intimidating and overwhelming. Whereas a gist, you don't need to know any of that. Just copy your code in and voila. So when I am writing lessons and all of that, I use GitHub proper. But when I just need a very quick thing of code to share with a friend, GitHub gist. And Charles says, do you put it in GitHub only at the end? Um, I think it's easiest to put it into GitHub at the end, but you can do versions. So, for example, in this quick demo that we added, um, if we click edit and we copy in a new version, um, so let's say we copied in a new version and I trying to find a good code. Um, and we don't import it as PD. Not best practice. I'm just doing a very quick demonstration here. Um, and we update the gist. We can see now that it's rendering our new update. And we can see a revisions tab shows us exactly what was changed. So, you can do multiple versions in a GitHub gist. Um, but I think it's just easiest to wait to put a gist up until you have a good amount of code because it does add a couple steps. All right. Um, and Adu asks, "Pip install pandas. " Yes, pip install can be a little bit of a hurdle. Um, if you're working in the data quest environment, you won't need to deal with that. Um but if you are working locally like in VS Code for example, you need to open a terminal. Um and if you're on Linux or Mac, I do believe the default terminal is bash and pip will work in bash. Now this is not going to work. I'm just showing the typing here, but pip install pandas. And if you click enter in your terminal, you should see a bunch of lines of code rush by as it's installing. And then at the end, it will show you the installed um library has done what it needs to do. And once you have that, you will have pandas on your computer. You can do this for any other library. So for example, Seabour, if you want to install that visualization library, this is how you would do that in your terminal. Um, I think my command line skills are mid at best. Um, but if you want to install multiple libraries at once, you can do so just space separated like this. Um, but if you do run into errors with this, I know I did on my Windows machine. Um, that goes a little bit beyond the scope of today's webinar. Um, I will say though, AI pretty decent at helping you debug terminal like command line errors. So, if you copy any error you get from a terminal into AI and say, "Hey, I'm trying to pip install this and here's what I'm getting. " You can iterate through like that and it might help you solve the problem. And tree lock uh exclamation point pip install is what you need here. Yes, I'm not sure if the data quest environment is set up for that. Um this might not work. Okay. Um yeah, you're not seeing the libraries that are actually listed because the data quest environment has a layer in between. Um, but exclamation point pip would technically work. We try to ensure that our Jupyter Lab environment is set up with any library that you could possibly want to use for a data analysis project. Um, because the pip environment just gets a little more complex here. [clears throat] All right, Ty asks, "Oh, we are at time and we have so many more questions. Thank you everyone who's staying after. I will stay four more minutes. Um and then I do have to hop off. Um okay. So Tai asks, "Do we have any tips or guidelines to use AI as a tool and not as a crutch? " Uh yes. I think that trying to write the first time you're writing any type of code, try to write it by hand. just I think especially as you are learning the basics of functions and classes and methods and like those key core concepts of programming, it's going to unlock layers of your own understanding if you can type it out yourself. And if you ever do see AI spit some code back at you and you don't understand any of it, that is a good indicator that hey, you know what? Maybe I need to put that aside for now and focus on trying it myself first. I do think AI is great to help you debug. If you get a trace back error, you know, like those big like, h, error, this is all bad. AI is great. you copy that trace back into AI and say, "Hey, what do I need to do to fix this um and it can point you in the right direction. " I remember I started learning how to program before AI was a thing and I would spend hours trying to debug a single line of code. Um, which yes, maybe it was valuable, but it al is also a big slowdown of my learning time. So, speeding that process up is really nice. — [clears throat] — Um, another one is if you know that there is a function name or a method name that you've learned and you just cannot remember the exact syntax. AI is a great tool for that. — [snorts] — uh for example when this webinar I was like what is this for the specifying the type of chart um hopefully my internet's connected I'm having a little bit of difficulty here Um, okay. I think we're okay. Uh, so I can ask AI, hey, uh, copy this line of code. How do I specify a bar chart? And it will come back with the kind All right. Um, so Ajith asks, "Some of them are objects. How do we check the object type? " So I don't know the exact reason why pandas says object here instead of string. Um, it's kind of getting into classes and all of that, but the way that I think of object is it's basically anything that isn't a number. Um, and if you then look a little bit more at your data and you realize it's a string, all right, it gets to stay an object data type. Not a big deal. Um dates are also initially parsed often as object but there is a better data type for dates in pandas. So you can do an actual datetime data type. So [clears throat] how do you check the object type? I think looking at info is the quickest way, but there is also the option to um let's just say autos and then name. I'm just looking at a random. Is it this? This is potentially a good use of AI. Okay, we can do some live AI checking here. So using Chandra, how do I see the data type? This isn't right. And we can see DT type is what I missed. And I don't need the parenthesis because it's not a function. And then we can see that DT type is listed as O which stands for object. So that's another way you can do that. All right. So we are substantially past the time. Um thank you all for sticking around for some extra Q& A. If I didn't get to answer your question, please do reach out to me in the data quest community. If you go to the data quest community um and you do a post in this course um and you tag me, you can tag me at anna_strol and I would love to keep the conversation going there. Uh if you have a few moments, it would be wonderful if you could fill out the feedback survey. Let me know how you enjoyed today's session. If you have any ideas for future webinar topics, I always read through those to help inform what the next webinar will be about. So, thank you all so much for joining. I hope you have a phenomenal rest of your day and I'll catch you all next time. Bye everyone.

Другие видео автора — Dataquest

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Дайджест Экстрактов

Лучшие методички за неделю — каждый понедельник