Solving Real-World Data Science Problems with Python! (Predicting Healthcare Insurance Costs)

Solving Real-World Data Science Problems with Python! (Predicting Healthcare Insurance Costs)

Keith Galli

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Оглавление (14 сегментов)

Segment 1 (00:00 - 05:00)

hey what's up everyone and welcome back to another video this is definitely a requested Topic in it we're going to be solving a realworld regression modeling problem in Python so the basic premise of this challenge will be given an individual and their you know body mass index the number of dependents they have whether or not they're a smoker where they live Etc can we build a regression model to predict what their insurance costs should be so this is a very ible very valuable skill to be able to build regression models such as this one to know whether to use a linear regression or a polinomial regression or some sort of other type of regression all super valuable in the world of data science so I'm excited to dive into this real quick I want to do a refresher on what regression is so imagine you have a scatter plot let's say age versus Insurance costs in this example well basically we want to figure out the relationship of age and see if we can use it to predict the insurance costs so we can take all of these points and then fit a line to it such as this and what we call that line is our model that is our linear regression model so that if we have a new age as input we can use this line to basically predict what we expect the insurance cost to be based on that line and then our r s value is basically how well does that do at predicting the actual costs and so that's a very simple linear example basically in this project we are using more variables we're not just using age we're using age if they smoke where they're from uh their BMI Etc and so you're basically taking all of those points mapping it to like a five-dimensional six dimensional X dimensional space and then fitting a line to that and you know these don't necessarily need to be linear um in certain situations like you imagine an Investment Portfolio if you looked at someone's Investment Portfolio you might fit in like exponential regression line to that and then you could use that you know let's say 20 years down the road to figure out how much they their portfolio will be worth so the process of figuring out what this line is to best fit the data that is what we mean when we say regression so we'll be work off of a data Camp project for this task so you can find it by going to data Camp you can use the link in the top of the description to get there um then you can go into learn real world projects I usually filtered by Python and then the one we'll be doing is from data to Dollars predicting Insurance charges um this is a premium project but if you use the link in the top of the description you can get 25% off of a annual data Camp subscription so you can get access to all these projects as well as tons and tons of courses as you can see here um on all sorts of useful topics I do want to be considerate that not everyone will be able to get a premium subscription though so you can also um access the data and notebook for this project on my GitHub page which is also linked in the description so you can also follow along there I think to get the most out of this video I encourage you to try the tasks as I kind of present them and try to build this regression model on your own but if you ever get stuck or kind of want a stepbystep breakdown you can watch the video pause try out a task on your own and then resume it to see how I would approach it I think the ultimate goal is to be able to get to doing projects like this where you can completely do it from a blank slate uh from scratch but I think a stepping stone to that is to see you know individual small tasks presented work through those you know then get the next task work through that and then ultimately complete the project I think this is a fun project definitely very valuable so excited to get into it and thank you to data camp for sponsoring this video link at the top of the description to get 25% off an annual subscription to data Camp shout out data camp for sponsoring this all right so to get started with the project you can click on it and again um the GitHub link is at the bottom of the description okay so we're building a regression model for this Insurance data so ultimately we want to be predicting these charges and we're given information such as age sex BMI children smoker um region Etc so I feel like my first thought with a regression problem like this is I want to just kind of think about it at a high level so just kind of thinking about these variables like what do I expect them to do to the costs and then kind of validate if those

Segment 2 (05:00 - 10:00)

expectations are correct so like I would expect that as age goes up the medical charges are also going to be up go up um so they're going to be positively correlated uh variables I don't know if sex will have any relation you know any um effect on costs I definitely think that there will be a positive correlation between BMI and medical costs so BMI is body mass index so if you have a very high body mass index you know you're overweight and you know more at risk for certain health conditions so I'd expect as this goes up your medical charges also go up so kind of thinking about that for each of these variables I think regions a more tricky one so it's like this is the US so my kind of really high level expectation is that if you're in the Northeast you know New York City Boston where I'm at uh I expect the costs to be higher I feel like everything's more expensive here but if you're maybe in the southeast maybe they're lower so I kind of have a hunch on these things and that's kind of a good starting point so we will test some of these things out and also I think getting into kind of what type of regression to use I think that's always a question so if your indep your variables here are these independently are like linear relationships you can probably use a linear regression model to model them all together but if you have some variables let's say like as you know some variable increases uh it's an exponential relationship with the charges then maybe a polinomial regression or a different type of regression would make more sense so that's another thing to be thinking about uh all of that being said the first task for this video will be to clean up our data so we have our data here and so we're reading it directly from this file but um if you want to read it from GitHub you can go to insurance. CSV you can um click on this raw form of it and then just take this link and you should be able to just go ahead and I'm going to do this in a different cell you should be able to do import pandas as PD DF equals pd. read CSV and then pass in this URL and then do like a DF doad and also access the same data so we see that we get it here so if you're not using data Camp feel free to do it this way GitHub link is in the description I want to make sure everyone can follow along no matter where you're working from but task number one so let's just task number one clean up the data and so this is basically task number one with all things so to give you some more specifics feel free to like just go ahead and clean up the data as you see fit but if you want a little bit of a hint here on what you might want to look into uh I would recommend looking into um are there any missing values I would look into uh I mean just immediately looking at like I want to standardize some of these columns so like standardize categories in region uh with these charges I see that the data type is a little bit weird we see dollar signs in some of the and then float values in others so I would say standardize charges data type I would probably make it a float make sure there's no strings there um let's see what else we need to do I mean that's a good starting point yeah I think there might be some other things that we'll get into as we uh move a little bit more forward but you know can we clean CLE up the data feel free to pause the video try this on your own and then resume it when you want to see me provide my solution and I might have some additional cleaning steps in addition to this so yeah feel free to pause resume when you want to see my solution all right so cleaning the data so are there missing values a good way to start doing this is just go insurance so we've loaded it in here and just going to make sure it's loaded in run that cell one more time I'm doing shift enter to run the cell and then go to the next cell so if I do insurance. info we should see some information on uh null values and I do see a little bit of weirdness here we have some values like we're missing some uh charge values for some of these columns or we have extra charge values or something so I think we want to be cognizant of that we could fill the nas uh with like an average or something but I might just recommend

Segment 3 (10:00 - 15:00)

dropping the na columns so I think that that's going to be I'm going to see how many columns were how many rows we're left with but I might do insurance I might call this insurance fill or something like that equals insurance dot. dropna and this will just drop any null values from our data frame and then let's go ahead and do it Insurance filled I don't know if filled is the right word but we'll do a info on that okay 1,8 I'm personally fine with that that's going to be all of our values that have no non-null values so I'm pretty good with that you could be more creative if you wanted to fill in some of those values uh using like fill Naas and take the average or something but I feel like that might skew our results a bit I don't love um filling things I think it depends on the column maybe that's empty but I'm going to just kind of go ahead and do this drop na uh to start as a first cleaning step I think that will be satisfactory so this is checked I don't know I want to I'll just make an X here we've done this all right standardize categories in region so to do this I might do a insurance fill region and then there's this helpful unique function and we see we get everything there it's pretty much all very similar only thing is there's some capitalization I think it makes the most sense to just lowercase everything so I'm going to go ahead and do a insurance filled region equals Insurance filled region and then I think we have to do string and then we'll do lower so then we'll do an insurance filled on the region again and we'll do the unique and hopefully we only have four values instead of uh the eight values we currently have and look at that it does seem to have worked we can check this off too um now I'm going to go back theinfo because this will help us know what the data types are okay so we have charges as an object it should definitely be a float 64 so I'm wondering how we could maybe see this a bit more clearly what types of data types are in this and is anything else weird uh we might want to double check that the sex column has only two categories I'm guessing it's just going to be female and male it might also have like a you know a null or you know not applicable um for other situations but I think that this would be assigned at Birth sex um options here so I'm going to do unique and we have oh jeez it's good that we did this we have female male woman man and uh F and M so there is just two categories but we do want to make sure that we standardize all of this so I'm going to add this as another one standardize uh sex to just two categories and let's think how we could do that we could do some reassignments uh I think that I'm trying to think what the easiest there's probably a mapping I feel like let's just make everything female and male feel like that makes it's easy enough to do or maybe F and M I kind of I don't just to make everything female and male so what I'm going to do is I'm going to add a dictionary here I'll call this like sex map or gender map sex map uh equals uh and so F will become female woman will become female and it might be good practice to make like a female string like this uh the reason this is good practice is because then it you know it ensures that you use the same exact string value and don't accidentally mistype something and we can do the same with mail and that will be man to male and finally M to male that should be

Segment 4 (15:00 - 20:00)

good and then we need to do an insurance sex equals Insurance filled sex. mapped I forget exactly how to do this is maybe calls for a Google search map actually what we could do within the data Camp platform and if you're using like VSS code you might have access to like GitHub co-pilot but I could do a um update my sex column to u values to be replaced based on the dictionary I'm like worried with the YouTube algorithm that just by saying the term that I'm going to get demonetized or something even though that's what the column is called uh but we'll just have to cross our fingers that they don't care too much that they are a mature audience or I don't know what I'm saying but let's use AI to do this oh no no I don't want you to okay uh it didn't change too much it left my other stuff as is the replace is the function we wanted to do that looks good uh let's go ahead and do that and then to check we'll do kind of the same type of thing we'll check our unique values and we should just have two um female male looks good perfect okay so this is good um uh maybe we do the same I think we kind of have to go through each one uh let's look at the info again okay this is float that's good this is float why is children a flat I want to make sure we don't have that's strange to me you can't have like 0. five of a children um so I hope that everything's good there but as long as it's not an object I'm pretty happy so I think we should next check smoker and just make sure that is just two values yes and no and honestly we should probably change it to Boolean um I'm trying to think how we could do that I'm going to say and we'll probably need to ultimately change like this to a Boolean value this will become something slightly different I think as well uh smoker column to Boolean and to do that we could just do something like equals Insurance filled uh and then we would want to just do uh I'm getting confused uh basically we want to check smoker equals yes and so if that's true then this becomes true I think that this should work let's try this and then look at the head okay that looks good now it's a Boolean value if we looked at info we see that's now a Boolean and Boolean will just be easier for us to work with um we could potentially change like sex to be like is female or is male and that also be true or false and I think we'll see that a bit more in a bit all right so we did this uh standardized charges data type so this is should be a float is that that's what we want and I think let's just look at that column of it what I might just do is I think there's like a sample function um and we'll do that on the charges and this will just give us five random charge values and I'm just curious what we see what types of things we see in the data oh I guess I have to specify five or like 10 how about and why do my lights always turn off all right sample okay I see dollar signs so going to keep running this and see if I see anything else that's weird it's really just a matter of the dollar sign so it's a matter of stripping that away and I'm doing control enter to run the cell I don't see anything other than the occasional dollar signs so I think

Segment 5 (20:00 - 25:00)

that we're probably fine to do a equals charges why do I keep Insurance fi charges do string. strip I think we could do this and then we want to convert it to a float so I might then do as type I think we can do float 64 that's what the other columns are we're doing two things in this we're stripping away the dollars and then we're making it casting it as the flot 64 type I think if we just did this immediately it would probably give us an error so we do have to do the strip first so let's try that and now let's look at Insurance filled umino let's see if it successfully made it float 64 looks good and if we do the sample we might want to just check to see if our um dollar sign are missing and I don't see any dollar signs there so I think that that's pretty good and the other oh wow it's good that we did this I don't know if you're seeing what I'm seeing but there are negative values in some of these columns that is crazy oh that will screw up the uh um regression so we might also just want to make sure that these are all positive values and like this is something that you would have to figure out uh you know from how you collected the data uh do we know that if it's a negative number that we someone just accidentally wrote a dash there and it should just be the positive version of the value or should we just drop that row entirely I'm going to go with um that it's just accidentally negative and we should just make everything positive um I'm trying to think of the easiest way to do that um I going to look up something like make all numerical columns positive and you know you could also say absolute value in panda data frame let's see what it says here okay feel like that I don't like that I kind of like this solution see if there's any other Solutions I mean if we really wanted to be simple we could just like do this for each column uh and you that's also a solution if you didn't want to do the Lambda one uh I might just do a oneliner though I'll take one of these Solutions I don't like the DF update I'd rather like do a DF do apply I'm going to go with this and this is a lot of you know coding uh and so our data frame is Insurance filled and we could say Insurance positive or something like that I might even just say Insurance final or something like Insurance positive P equals Insurance fill. apply and then we also need to make sure we import numpy so I'm going to import oh it's already imported so we should be able to do this and then we can do insurance pos. sample and see if we ever see any more negative numbers all right let's run this a few times

Segment 6 (25:00 - 30:00)

I don't see any like decimals or anything there this is all looking pretty good I like this sample to just kind of confirm you could also just like load the full data frame in and look at things uh but this looks pretty good I would say that this is a sufficiently cleaned um data frame if you wanted to you could especially if you're working locally you could save this um as a you know cleaned insurance. CSV I probably set index equals false Etc but I'm pretty happy with this let's move on to task two okay so task number two I think this is just generally good practice for regression problems but let's do some uh Scatter Plots of relationships between are independent variables so we want to see the relationships between things like age and charges children and charges and just see what these relationships look like because that will help us pick what type of regression to use uh between variables and charges that's going to be kind of what I'm saying so feel free to pause the video try this task on your own and then resume the video when you want to see how I would approach solving this is kind of like not a necessarily necessary I don't know it's just good I think exploratory data analysis to just see some of these relationships get a feel for things before actually deciding what type of regression to use also really want to really quickly want to call out that within the data if you are using the data camp platform to do these projects one section that I like especially if you're wanting to do more of these types of projects in the future and want to kind of break it down in a similar way go to the guides and uh you know get recommendations on the steps they recommend and I feel like if you B break it down into those bite-sized chunks you know start with the data exploration and cleaning and then go into the model development and training and then making predictions and kind of like break it down on the step to step by-step process that they outline makes it a lot more attainable of a project and as you get better you probably will be able to figure out these guides kind of on your own and have your own intuition so worth just mentioning quickly okay so task number two Scatter Plots of relationships between variables and charges so I'd recommend for this I'd probably use M plot lib as PLT that's typically what we'll here and then what we'll want to do is do things like plot. scatter uh we have our insurance POS I might even call this like something like clean DF or something like that or DF equals Insurance pos. copy how about just so I can work with data frame instead of this named thing and so now I can do DF age and then our Y is going to always be DF charges run this cool oh wow interesting so AG is down here and we definitely see a slightly you know positive correlation here um it is very scattered though let's look at our other columns like this is going to be weird with the sex because it's male or yeah this doesn't seem to tell us much of anything it seems like there's pretty similar amounts of dots in both Spots I'm curious to do the region real quick to see if there's any big differences in regen um no pretty like it's hard to get a full frequency count here like there might be some ways we could modify this to see frequencies a bit more clearly but it seems like regen might not affect things too much I think BMI definitely should um oh interesting it do there's definitely a positive correlation here but there is a lot of kind of uh interestingness down here where there's more flat relationship uh smoker let's check out that column yeah look at this is very

Segment 7 (30:00 - 35:00)

clear uh this is remember is a Boolean so one means true they are smoker zero means false they aren't a smoker definitely has a positive correlation on the charges they will see cool um I think the biggest thing I think that's good for test number two there's no real like right or wrong answer I think that from what I'm see there's no like clearly uh like exponential or weird uh relationships here so I'm pretty fine using a linear regression model here I wonder if I created a cell and I told AI create exponential scatter plot example I love how easy AI is to demonstrate stuff if we had something like this as one of our like let's say I don't know what x value would be here but this was our charges and our graph looked like this linear regression would probably if we wanted to incorporate this variable linear regression would probably not be our solution but we didn't see anything you know quite like this in our data so I'm totally fine using linear regression but uh another thing that I tend to do is like even just using like a AI to ask hey when do I use what types of regressions when that could help you kind of be on the look out for the right types of things but let's go to task number three feel free to pause the video I guess you don't know the task so it' be hard to do this but task number three what is it going to be well task number three is prepare the data for a model fitting and I'm curious what the guides say over here uh the biggest thing that you typically we'll need to do is that when we're fitting a machine learning Model A regression model we like numbers we don't like strings like male female or um you know Southwest Southeast we like numbers so we want to convert all of our columns I would say to numerical values uh Region's a little bit interesting thing because there's multiple categories it's not a simple Boolean you might think oh I could just make this zero I could make this one I could make Northeast two and then uh Southwest no Northwest three uh you don't want to do that because that implies like this some sort of linear relationship between these numbers or you know monotonically increasing and the model might fit to that so instead what you tend to like to do is um break it up into like a one hot encoding it's called so that might look like uh something like is southeast is Southwest is um Northeast or something like that and you might have you know ones or zeros in all of these spot oh I guess it couldn't be Southeast and Southwest so you might have something like this the one hot encoding would have all of your columns here and it would be something like that there's also um an encoding called dummy encoding which basically is like okay if we have four categories we really only need three of them because if we get rid of this fourth and this is a zero from the process of elimination we know that the only other possibility is uh Northwest so this is called a dummy encoding so I think we should dummy encode our um region and then probably just do like a binary encoding for the other columns here that aren't anything that's not a number already so let's go ahead and do that feel free to pause the video try this on your own and then resume when you want to see my solution okay so there's multiple ways to do this uh if you don't remember how to do the dummy encoding you could just do something like you know DF region equals Southwest and you can do this for each of your regions and just do it for three to do the dummy coding you also could look up like dummy encoding in Python pandas and find

Segment 8 (35:00 - 40:00)

a another way to do this so let's see how they do it so they have let's see what happens if we just do get dummies and we do the can we do this on a single column on single pandas column so we want to do pref okay this makes sense cool so another thing you could do here is if you didn't want to manually do things out like this you could do DF equals DF dog dummies was pd. get dummies or was it okay it's pd. getet dummies pd. get dummies and then we pass in DF and then our prefix can just be equal to region how about I don't think we need to make this a list if it and then our columns we just want to do region uh we'll see if these have to be lists or not maybe does it highlight this H we'll try this first I'll call this DF new so we don't replace our original one okay it does seem like it likes these being lists let's see what this gives us cool look at that region north east Northwest uh thing that's tricky here I guess we could just drop one of these we're going to count values we're going to do DF region dot value counts they're all pretty similar I think typically the standard practice is to drop the one for dummy values that appears most frequently because these are pretty similar shouldn't matter but we're going to say DF new equals DF new. drop columns region Southeast and that still gives us enough information for our um model fitting um but now we just kind of dro the extra column that we don't actually need because if it's all zeros we know that by default it is southeast so let's look at that Northeast it looks good cool um let's convert the other ones to zeros and ones so I think we could do DF new smoker equals DF new smoker. as type int so 64 let's see what happens if we do this okay that's now ones and zeros that's good we also want to do it for male so I'm going to do DF new sex equals DF new um SE equals let's say male so maybe we create this you could create like a new column like is maale or something like that um it might be more clear uh that's going to be this and then S Type int 64 int64 is not the most practical this is a small enough data that doesn't really matter realistically you know this could be just like an INT or something uh the eight represents the number of bits used

Segment 9 (40:00 - 45:00)

so 64 bits used eight bytes if you're encoding a one or a zero you don't need you only need a single bite uh you only really need a single bit so there's more space efficient ways to do this but uh we'll just do it like this it's not a big deal I think you always have to ask yourself does the efficiency really matter too much in our specific use case I don't think our dat is big enough to really matter make a difference here so I don't really care about being the most efficient with my data types um and I'm going to go ahead and say DF new equals DF new. drop um columns equals sex and now let's just look at our DF new okay one z um is mail cool this is all numerical values so now I think we can go ahead and fit a model I think I'm pretty happy with this I think the only other thing I might do is like you want to probably make a lot of this stuff into a function so that when we need to do this for our validation data that it's all ready to go so we'll probably have to do that in a sec but we can make that a another task number four fit a linear regression model to our data multiple ways to do this new data frame is DF new right uh you could either Google how to do a linear regression like I don't remember the syntax off the top of my head you could just trust the AI too so I might just type into the AI make linear regression model using all columns except for charges as input and try to predict charges column as output our data frame is called DF new something like this I'll see what it does um cool this looks good accept that we don't really need to do this train test split um but we could just see what happens here uh and I'll try to share some stuff but go back to the metrics up here I it um cross okay it might like you could also play around with different types of models like you don't have to only say that linear regression is the best option there could be a better option um so like keep in mind that one um method that you may use is trying multiple models and seeing what gives you the best results so maybe we'll see that a little bit um let's try this what does this give us input y contains n an what are you talking about float 64 that looks good right oh wait why is there one null entry that's so weird I'm so curious what happened here is it getting confused H let's just do a print DF new. info 128 entries 128 non

Segment 10 (45:00 - 50:00)

nuls and oh what the heck how did was this always 1207 when do we lose uh an NA or something that's so weird what I'm going to do is I'm going to just do a DF new equals DF new uh drop na columns equals charges can I do this okay I want just Google this drop na for certain for a specific column I mean actually if we just do it a another drop in a it will only get it down one column okay there's other ways to do this but I'm fine honestly just doing a another one of these and now if we run this again we should see 1207 for all these okay that's fine with me uh okay mean squared error this seems like a really high error I want to do R squ error not mean squared because that's what we're ultimately trying to do r s python R2 score okay let's do that we're going to do R2 score and now equals R2 score and then we want to Output R2 so let's see what our R2 score is here based on this linear aggression model 70 I like that a lot um I'm pretty happy with that it means that from these variables you know we can account for 70% of the variation I wonder what happens like one thing that's interesting sometimes to play around with is you know play around with like not using let's say the is mail column run that again about the same it didn't really make any difference if we did drop something like the um age column I bet you that it would be more problematic see that drops it significantly if we did drop the smoker column I bet you it will make our model much worse wow the smoker column really tells us a lot huh that is crazy we have like an impossible time predicting the charges if we drop these columns that's crazy um so the smoker one seems like the strongest variable um we'll keep that we just don't want the charges we'll keep all of our values another thing you might do is you might try different types of regression so we might try like even though I thought that um linear regression made the most sense we can try other types of oh of models from sklearn Dot let's see regression models psyit learn okay linear models like you could do all sorts of crazy stuff these are all like maybe we do a linear rep nomial regression uh is there a way to easily

Segment 11 (50:00 - 55:00)

import this model okay so this is like slightly different it like makes a linear space I need to review my models um but we could do like kernel Ridge probably so let's see if this works and then we could pass in where is okay try kernel Ridge let's see if this works okay so we see that we actually get worse performance like the linear model is better that does like 60 67 R squar and if you're not super familiar with the r squared concept and what that means higher V value is better um if we got an R squar squore of one it would mean 100% of our variability can be described by The Columns that we're using um in our fitting function when we see a value like 67. 9 here that means 67. 9% of our VAR variation can be described or can be kind of accounted for by the variables that we're using in this um okay so that's pretty good for our and you could keep trying other types of regressions I think we're pretty happy with the linear regression um so I'm fine with this oh what happened oh just need to go back to linear regression get the 70 and now we want to validate we're want to try this on the validation data set okay store the predictions as a new column in the validation data set called prediction charges so Val DF equals pd. read CSV there is another file if we show files called validation data set and again you could easily load this from GitHub by going kind of similar format going to the validation data set file clicking on Raw copying this URL and pasting it into your pandas um. CSV Val dfad let's just let's look at this oh what is the file called validation data set okay there we go okay I would say we need to do the same data pre-processing so what we might do is take all of this stuff we did before so we'll call this task number five see how model performs on validation data set um feel free to pause the video try this on your own as a kind of subtab TK create helper function to preprocess data frame def all right pause the video feel free to try it out on your own and then resume I'm going to create a function called pre-process h DF that'll take in a DF and ultimately return a DF and basically all we want to do is like all the same steps that we did before my hope is that there wouldn't be any messy data in the validation data set I think we could do a little bit of sampling to kind of check out what we might see and what we might not um so there's no charges here but do we see any negative numbers or anything no it looks like it's overall

Segment 12 (55:00 - 60:00)

cleaned we just need to really do the last steps that we did holy crap there 13 children that's crazy that's funny um okay uh all right so I really just need to like get the do the dummy stuff and the other binary variable conversion so where is that done this is all part of it I think that this is mainly it is there any other stuff that we need to do I don't think so I don't think we need to do these steps I don't see any of the other weirdness that we had to do before so we'll just take this aspect we will paste it into our function um honestly it's already using DF here and I think as long as we return uh DF new here we get our input DF equals pre-process I'm going to run this with shift enter pre-process DF pass in the validation DF saving as a panda data frame called validation data call this data if needed handle any negative values by replacing them with the minimum basic charge set it there's negative oh man okay I'm gonna not worry about this last part of that and then ultimately we want to pass in the model predictions equals model. predict validation data and honestly I could just call this like input DF I'll show you a reason DF input DF uh let's see our predictions real quick just make sure it works properly what's going on oh was there a stal step oh we already made the Boolean One smoker here okay I got to fill this to copy this into the function so we'll first do um so this should work now run that run this cool look at that those are the predictions uh and it says fill negative values with the basic cost which is set it a thousand uh how could I replace that I could just do predictions new equals uh X for X in predictions uh if x greater than zero else 1,000 so that would I thought that this is right syntax X for X in predictions if else in list comprehension python f ofx ifx is not none else okay so I need to do the else first X

Segment 13 (60:00 - 65:00)

else th000 for X in predictions let's see if this works that seems to have worked and now finally our validation data is going to be equal to uh and a predicted charges okay first we need to do validation data equals Val DF do copy and our validation data is going to be with predicted charges is going to be equal to uh predictions new so that adds the column and then if we looked at our validation data. head we do see we get our predicted charges in all of those and the minimum basic charge is a th000 so I probably should replace anything that's under a th000 to a th000 but uh I think we could run all here first hopefully there's no errors after and then submit project where is it your solution does not look uh I'm going to just remove this line real quick what where is this coming from can it tell me where this issue is greater than is not supported and one other thing I might do too is like our model should be trained on all of these we don't need to leave out any of the like I'm just do the test size equals zero for the purpose of um or like as small as possible for the purpose of uh like we just want XT train and X test to get as much training data probably to make the best um valid you know best model for the validation data if that makes sense I'm wondering what's going on run all function and Float I wish it told me a little bit more okay well I'll do this oh I mean this will actually just solve because if we do under a thousand that will solve it for the negative case too so I could just get rid of this I'm so confused by greater than is not supported between instances of function and Float what is that trying to tell me like validation data am I just supposed to have only the column I just don't know how I'm supposed to save this uh do I need to save it back as validation data set. CSV um I'm pretty sure we're good I'm happy with this I might just look at the solution quick which is available here just to see how they want me to store this so I can submit it properly uh because it seems like we have predictions for everything here predicted charges so I don't know what's going on here function and Float I don't know where this error is popping up

Segment 14 (65:00 - 67:00)

uh maybe it's because it wants me to save this as R2 score the thing that's annoying is I I'm importing this R2 score function so now it's all weird and this is probably where this error is occurring but that's just a little weird Nuance oh man we'll see if this works it works okay cool that was a silly little thing but um glad we figured it out uh reading the instructions helps there um that was fun five stars um cool well again the link in the top of the description will give you 25% off of an annual subscription so you could you know click that and I forget exactly what the page brings you to but um tons and tons more projects to try out I think that like try to do them more and more with less help and it will make a big difference like these types of projects are very applicable they're similar to what you will need to do in the real world so I think that they're a great way to kind of Advance your skills especially if there's additional things other than python like uh this customer analytics one is uh I think it's um SQL and bit of python let's see um or actually go back like there's some that are both SQL and python uh like maybe this yeah retail data pipeline if we look at this project we see that you know there's actually like some postgressql and SQL commands we need to run in addition to dealing with like parquet files which is a very you know uh important concept to understand so a lot of great projects here um Link in the desri top of the description to get 25% off thank you to data camp for sponsoring this video hopefully you enjoyed this project hopefully it was a fun real world uh regression example if you have any questions let me know in the comments I'll do my best to respond to as many as possible um I'm trying to think if there's anything else um if you like this video Throw a thumbs up subscribe to the channel if you haven't already more videos coming soon let me know if there's any topics you want me cover until next time everyone peace out

Другие видео автора — Keith Galli

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник