# Machine Learning with Big Data using Spark and Tidymodels

## Метаданные

- **Канал:** Andrew Couch
- **YouTube:** https://www.youtube.com/watch?v=gfvNln7uAuQ
- **Источник:** https://ekstraktznaniy.ru/video/44692

## Транскрипт

### Segment 1 (00:00 - 05:00) []

hey all it's energy couch here and this time today's a video we're going to be going over how to create machine learning models using big data but first i have a huge announcement this will be my last tidy tuesday video for the series and uh will be the end of my regular scheduled content but um i'll kind of talk more about the details towards the end of the video but first let's actually open up a our markdown document and i'm going to call this tidy tuesday big data all right and we're going to load in a few packages set um a few packages so we're going to load in the idverse i'll make this a little bit bigger load in the tidy verse id models how do you predict fine tune for training some models a sparkler for creating our models using um spark as a back end and also the dbi package um we're also going to load in our options and i always do tidy models dot dark is equal to true and lastly we can also load in our data set which is a customer churn data set that we've used in the past for some of my other videos okay so if we look at our customer data set or our customer return data set we have a customer id some general attributes of the person um some monthly total charges and whether they actually turned or not um if we actually look at the missing values let's say is. n a and then we do like a call sums we can see that there are 11 missing values so considering that we have about 7 000 rows i think it's fair to say that we can just drop those missing values or drop the samples with the missing values so let's actually do some data cleaning um i'm going to mutate across all the columns where the variables are a character and i'm going to say as dot factor right now everything is a factor if they are um a character and then lastly we're going to do the drop n a and that'll drop these missing values so we're going to say create convert features to factors and drop missing values um i could do a factor thing for the senior citizen but senior citizen is actually a uh is a uh like a one hot encoded variable already or a dummy variable so we don't have to do a lot of data manipulations to it so when we're thinking about big data um we're also thinking about like having a large volume large velocity you know a lot of samples but one of the main problems when we're facing big data when we're making some predictive models is that oftentimes the data that we're you know trying to use can't fit into memory which is like the amount of ram we have now we can always set up you know different environments using like cloud computing and stuff like that but oftentimes we might want to think about how we can do it like in the database right since databases have a lot of power behind it and you know ideally it'd be kind of nice to just have it run out in the database instead uh so that way we don't have to use it on like our nice little laptop so we have at our office um and there are a lot of ways we can do this uh for me i don't use a lot of big data programs so like a lot of common ones i think are like hadoop and spark so like i think hadoop hive hadoop slash hive and then spark um but for me uh i really don't use a i really don't uh face a lot of big data challenges but i often want to make some predictions in database because oftentimes you know we can if we have a model that's pretty good and we can put it as a essentially a query then we can have the databases do all that work you know we don't have to worry about dependencies with r we don't have to create our docker containers or stuff like that we just create our query give it to someone have them schedule it out and we can go on from there so this is the first method we're going to do which isn't really about big data but it's about how to make you know predictions in database using machine learning models with the tidy models framework so we're gonna just do a classic uh train and test split we'll say initial splits a df i'm going to deselect the customer id and we'll say strata is based off of churn and then we'll create our trained data so training model splits and then our k-folds data will be using our train data oops

### Segment 2 (05:00 - 10:00) [5:00]

cv so train data okay um now we're going to do a kind of a typical run um in this case when i know that i want to make a model that's going to be in a database i'm trying to think about you know what are some nice strong models but also that aren't too large so oftentimes when we're thinking about using machine learning models we want to do like you know a lot of tree based models but the problem with tree based models is that you know oftentimes we're using like random forest or actually boost and like with random forest we're gonna be creating a lot of trees and each tree is essentially going to be like kind of a query so that can get very large so we have if we have a random forest model being trained to predict a customer churn you know oftentimes that model might have like 500 trees which can actually come raising issues with the you know the query um since the query can be too large and you can run into weird system errors but that being said i like to do mars models when i'm thinking about doing a sql model so i'll just say num terms equals tune uh prod degree or the you know the interaction term we'll tune it and we'll say set mode is a classification we'll set our engine to be earth right cool and then we have to do a slight recipe so we'll say rec turn data is equal to train data um and then we'll say step dummy all new all nominal predictors right so if we do a prep and juice oops juice we can see that it's created our dummied variables right um and then we still have our stuff so obviously with our most of our uh our numerical variables or our continuous variables is ten-year monthly charges and total charges and then we have our other variables stuff like that okay so now let's actually tune our model and if we notice um i loaded in the fine tune package which can allow us to do different tuning methods that'll kind of select our parameters for us so i'll say mars res tune sim anneal give it our mars model our recipe our k folds data and then our metrics i'm just going to say we want to optimize or evaluate our each iteration by roc aoc okay so now um it'll be training and we should be able to see yeah some little dialogues right here we'll just kind of wait and see what it looks like over here and i'm going to do a autoplot and say rs res so we can do some pre tuning we'll also create um some models or some just um some pre-coded stuff right now so you know we might want to do show best mars res and we'll see the roc aoc and then we'll actually create our final model which will be this one right here and we'll just wait just a little bit so often times i think we hear a lot about big data but for me from like a basic you know data science workflow i don't think you actually will see a lot of big data unless your job is you know has that type of data in its field so you know if you're doing a lot of like webs like web stuff or e-commerce you might have a lot of big data if you're doing a lot of streaming or like kind of like tech or uh video stuff like that or with sensor data you'll experience a lot of big data but you know for like your general generic like you know your typical machine learning problems generally won't revolve around big data and um one of the problems with big data is not just the machine learning part of it but also like how we're going to store it how are we going to query it who's going to host it you know what's the cost of storing it and like what actual insights can we get from this data um that are useful you know um oftentimes when i'm looking at um problem machine learning problems or if i want to think if i'm thinking about using models for a project i'm often not looking for the number of samples in a data set but mostly the quality of the data in general so that's another thing that we should really kind of think about that oftentimes if we have like three million or four million samples it might not be as good as like you know a 5 000 sample data set with like great quality control on it you know we have

### Segment 3 (10:00 - 15:00) [10:00]

nice and like the data is sanitary you know people are putting in every all the inputs um from like a reasonable standpoint so that's one thing that we need to consider when we're doing stuff with big data or even thinking about is understanding kind of the source of that um before we actually start running an analysis so we can kind of plot out our um our model metric so we can see that the degree of interaction is one and that's slightly higher you can see that it's almost like an intercept or we're getting higher rrc aoc performance um and we can see that the number of terms have increased and we can uh be comfortable using five so we can do that show best five terms with a product degree of one let's actually look at this so we'll do five one and one of the reasons why i'm not creating a workflow is that we wanna actually um do some pre-processing um with it um that way we can actually look at this in a database because oftentimes we don't we're not going to be able to use our pre-processing inside the database so um let's actually fit it to our churn data is equal to rec prep juice and also use the vip package to do the right so we can see the 10-year internet service optic and payment method check are the important uh variables that we see right now lastly we can actually do a workflow just to evaluate on the test set to see how it performs um on our test set or are there the data sets that we haven't seen so add model and i'll just copy this thing in right here because that's the best parameters i will add the recipe which is the red and then we'll say last fits i think we called it model splits and we'll say collect metrics right so we can see our roc aoc is 0. 80 i think our original one was 0. 83 so we have just a little bit less in the number of terms um but um that's fine you know this is just kind of an example so you know we're not really working over um like how to tune a model since i've made plenty of videos on how to do that but now let's actually think about say we have this model and you know it's good we know that it's a pretty easy model to build out because it's you know using five terms there's not a huge amount of interactions we don't have a lot of features that we're going to be using so we can actually create this into a model or a sql query model um very quickly so when we have our fitted model right here we can actually just pipe it in to tidy predict and in this case we'll do title predict sql and then what we can do is we can do the uh was a db plier which is included in um d plier and we can just simulate um whatever back-end database we want um i think i'll just do ms sql i think ms or notes mysql um so my sql is i think the most like bare-bones sql you can use and right there since we know that mars models are just like kind of like spline models um they're pretty easy to actually write out or convert to sql script so we can see right here that's doing it um it's just a bunch of case statements but let's actually see that in practice by creating a little database or a little table and making our using our query to actually test the model or the sql model so we'll say db connect this from the dbi package uh i'll use r sql with it rc full um white and then we'll say sql lights and our db name oops db name will be in memory right so we're just saying we're going to create a table in memory that's going to act as our database and then what we'll do is we have our data frame right here which is the one that has no missing values but wasn't like the training test split um we'll actually do is a copy to give it our connection and then our data and now we have our data frame as a table into memory which is going to kind of simulate what it's like to be uh oops what's going to simulate what oh well it's we're going to do for our database so i'm going to actually expand this up and then we'll type in sql connection is equal to con and if we do a select star from uh df we can see our data frame is right here and it's fine and dandy

### Segment 4 (15:00 - 20:00) [15:00]

so if we look at this query right here we can see that there's a few things like it's saying payment method electronic check and that's essentially a dummy variable because if we look at our variables that we're using from right here it's tenure internet service and payment method so let's actually do that so let's say uh was it internet service and payment method and tenure we'll also select the customer id that's our id and then we'll say also churn right um so we can see right here it's uh you know we need to do some manipulation to it luckily that's pretty easy um it's just a bunch of case statements so you know this is where we can brush up on our sql and do our little service right here so we'll say um with internet service it's saying uh fiber optic right fiber optic yep we'll just do a case statement right there so we'll say case when equals fiber optic then one else zero and as and we'll give it whatever the name is for the internet service fiber optic and actually i think it might be um let's see oh me count oh yeah it's right there so fiber optic so it has a little dot that'll replace the uh um the space so we have that and then we'll also do the payment method right so base when payment method is equal to um what's the payment method uh payment method equals electronic check so electronic check then one else zero and as internet service uh where is it and as a payment method electronic check oops and we'll actually make that into a little calm we have that and finally we can just do a little select star so select from or a little sub query we'll say select from and we're going to be selecting from this other query right here um i'll give it an a and then we'll copy and paste this just to test it out right so we have that and then we'll say churn rob and we'll also select the customer id okay so now we have the probability of churn with our customer id um using our sql model so i'm going to call it say output. bar is equal to sql now we have our sql pred equal red and let's actually create our um final model so we'll just say uh where is it uh where is it oh there it is so we'll fit that right and then we'll also augment it to our whole data set so we'll just say we'll use our recipe to process our data and we'll also bind our columns where we'll select the customer id right and then we'll select the custom customer id and we'll select the uh dot preview so dot pred yes and we'll say this is the mars model and we'll left join our sql pred uh by um customer id cool and we'll rename to sql model is equal turn rob so just a basic amount of data manipulation but i want to kind of highlight um i'm an example of sometimes the actual model the sql model will be slightly off so if we have our mars model our sql model uh we'll just do a gm point so what i want to do is plot the actual direct comparisons so gm ab line which will be our little y equals x um line i'll also do the chord equals observe versus pred uh we'll see that this line or the sql model is not meeting it perfectly in fact you can see that the slope of it is actually pretty off and that's the one kind of that's actually a huge problem with you know

### Segment 5 (20:00 - 25:00) [20:00]

converting these mars or converting these machine learning models into a sql script is that you know we only have case statements and stuff like that um we can't get super fancy and sometimes it doesn't uh translate uh perfectly especially with like classification models and stuff like that but what we can do um is actually do a minor adjustment to it um also i think changing the back end so in this case we're using like ms sql but you can also change it to like i think i mean we're using my sql but i think you change like ms sql it'll actually kind of fix it um so it depends on your back end so um you might not have to change anything but it's also good to kind of make these plots to compare whether what your sql model is doing with compared to your true model or your true like you know our model but this is a very easy fix um what we can actually do is that we can select this um and right here we can think like oh hey we can just like change the slope of it which is you know just a minor adjustment just a linear regression right so let's do that so let's predict the mars model probability using the sql model and then we'll say data is equal to dot and then we'll augment it out and then let's see the actual adjustments so x equals mars model y equals dot fitted oops and then we'll say gm point plus gm a b line plus chord obs spread and we can see that it's been adjusted pretty well so it's not perfectly one to one but it's good enough right so you know might have to do some more probability calibration stuff like that but in general this is a pretty quick model that you know we can did we can do a minor adjustment to it and do pretty well so what i'm going to do right here is actually get the coefficients from it and let's actually go into what we would do to adjust our sql model query i'm going to grab this right and i'm just going to kind of show off some of my sequel that i know from like school and stuff like that so what we're going to do is do a with statement so with um i will say uh dia de uh data as and we'll select this and then we'll say model data as and we'll select and that'll be from df uh from you don't have to do that from data so i'm basically creating this sub query right here and then i'm creating another table that's going to be using this created query so i'm creating a query right here and then i'm using i'm creating another table to query from that query and then i'm going to do just a sanity check so i'll say select star from model data all right so we have that and then finally we can do like adjusted red as that's and in this case we can select the model data all right so we'll say uh churn prob and then we'll also do a little adjustment using these this linear regression so plus churn uh plus so the at the intercept plus the coefficient times churn prob and we'll say as adjusted rob and also do customer id uh customer id from churn pro oh and then we'll do uh select star from adjusted red all right so now we have our adjusted prop right there right uh and we'll call this output. var equals sql adjust because why not okay now i have receivable adjust and we do a little sani check right here so we'll just uh get this thing right there and we'll left join and we'll say by customer id and we'll pivot longer and this will do the uh actual do some renaming first just to make it cleaner so rename

### Segment 6 (25:00 - 30:00) [25:00]

uh old i'll say non non-adjusted rob is equal to churn prob and then we'll pivot that just make it tiny or tidy so adjusted prob and then and finally we'll plot it out so just do a quick gg pot um point right so then we do it ooh what's your co y goals value what's going on here uh let's see uh mars model okay and you can see how we came from this to there so we just kind of spread it out which we did a little minor adjustment again nothing too crazy pretty fast adjustment right there to make our model you know the probabilities a little bit uh more situated for me personally i would probably just run like a logistic regression and just do a you know adjust it through that instead of doing a um a linear regression but you know it depends on your data and stuff like that but that right there is a pretty quick way of creating a quick model uh putting into a query and then you know evaluating the query to see how well the sql model uh is compared to the old and then putting it into like a database or giving it to some type of like sql developer and scheduling it out okay so that is just you know that's not only big data um i'm sure most of you guys wanted to do big data stuff like how can we actually fit a model using the cool flashy things like hadoop and spark um in database so we don't have to worry about any of this stuff um luckily i found the sparkler website so what we're going to do now is actually create a you know a simulated spark um like database i guess or uh is that a cluster so we'll say sc spark connect master equals local if you don't have um sparkler installed you may need to actually have this and then right there so i just copy and pasted that and there which will install like an instance of spark um but again it's pretty straightforward right there um okay so in this case we're going to create another thing our table um in this case we have to do a turn table and say copy to sc and we'll give it our ds or df and if we look at our turn table it's a spark connection we can see our little churn data set right there and let's actually go and do some you know data manipulation or show you how to make like essentially a a pipeline for creating your you know spark models so right now with our turntable um i'm gonna do a few things let's just do like our i don't know like internet service uh maybe i could do a maybe i'll do this quickly yeah let's actually do it okay so i'm going to just do a brief um kind of like a p-value approach to a gom so i'll say um oops train data and we'll say uh let's say a glm um churn data equals uh wreck crap juice fam oh family is binomial and we'll tidy it and then we'll say like filter like p dot value five so now we have our things right there this is something we'll just do uh you know i just felt like you know we'll create a random forest model but we use like a p value to do the uh like feature selection even though obviously random force can do it for us um but so what we want to do now is say what are the things that we need to change right so right now we have return table um we want to do our contract year so let's see what our contract year looks like um where is that at contract so we'll say mutate we'll say uh one year is equal to if else so contract is equal to uh one year oops one zero so in this case what we have to do is actually do like the dummy variables itself i have looked at oh i've looked at a little bit on like the uh the things that we can do so there's like a binarizer a threshold to a calm such that lesser equal bucketizer um it does cut but i'm not i don't think i

### Segment 7 (30:00 - 35:00) [30:00]

really saw a uh maybe there's a dummy thing but i see i feel like all these are uh stop words and stuff like that so um we'll just do it um manually right now so we'll do our one year we'll also do our two year so we'll copy that and say true year so if it's equal to i guess okay so two year and then we'll do paperless spilling so yes is equal if else and uh where's paper was bowling at um what's it called what paperwork a paper was billing oh there is paper was billing is equal to yes one zero and lastly we'll do i guess payment method so a method is equal to or if else payment i'll say we'll say payment electronic so if else uh payment method is equal to what's the electronic check we'll give it a one and zero okay and then um we'll have to select the thing so select uh churn uh senior citizen uh tenure uh total charges um we got forgot multiple lines so we'll do that again uh we'll say multiple lines yes those if else uh multiple lines equals yes one zero okay cool uh so we have our 10-year total charges uh we'll do our multiple lines yes our contract one year contract two year uh paperless and our payment method electronic yay let's see what's going on here in our pay method electronic oh right there uh okay let's see here select payment method okay oh there you go okay so we have that which is essentially like the transformation that it's going to do because this is very easy to transform uh this like d player syntax into like a sql thing so what i always call is like we'll call it spark the plier right and what we can kind of see is we can actually convert this transformation or these like features stuff like that into our transformer so we'll say our source and we'll say our spark applier and it kind of shows that right there you know if we say ml param which is like the upper functions uh we can say statement and we can see what it's doing where it's creating the uh that uh sql right there okay so now let's actually go on to kind of creating our model or our model pipeline this essentially right here is our recipe and now let's actually put it together and create like a workflow so we'll say ml pipeline we'll give it our source we'll say our uh will fit our dpr transformer and our table our table i guess will be our this function right there and then we'll say fits our formula and in this case i like to i just copy this in right there give it a tilde and then we'll just do a search and replace oops not that a search so we'll say that and we'll replace it with uh that leave cool we have that right there and then we can fit our model i believe so we'll say ml and then we say random forest classifier and this will be our spark we'll call it our churn and if we look at our churn pipeline it kind of gives us a nice little like uh i guess workflow it's a more complicated recipe of it but that's our pipeline right there what we

### Segment 8 (35:00 - 40:00) [35:00]

can actually start doing is creating our tests and training test splits um and that's kind of useful because you know why wouldn't you just query it do a select star from the churn table and then partition it like that sometimes you know you can't store it in memory in our laptops so oftentimes we might want to put it into our database so we'll call it our spark split and we'll say sdf which is like our spark function and then we do the random splits we give it our turn table which is turntable uh we'll give it our training which is 0. 8 and our testing which is 0. 2 and now we have our smart split our spark split object and it returns our train and test tables which is very nice and finally we can fit our model so we'll say our spark ml bits give it our pipeline which is our model you know our preprocessing our dpr transformations the formula the model we want to use and then our data set so in this case we often use our train data sets right there so now it's fitting the model and finally we can actually go and make our predictions on our test set so we'll say spark pred and we'll say ml transform which essentially is saying like hey we have this data set or table and we're going to pass it to the pipeline and transform it into our thing that we want to get out of it so we're going to give it our spark model and then our spark split with our testing data set and now we have it so what i would want to do is select the probably customer id uh what was it customer um what was it called it's uh yeah we'll just look at it all right uh okay so it's gonna pop up see it actually takes a little bit uh because there's a lot of stuff that goes into it but we have our churn so we'll say uh we'll select our earn um and then our probability did i not put the uh oh yeah oops oh well right turntable or trencher will shove it right oh yeah i forgot to add the customer id into this so obviously i want to put this right there that way it'll keep this the customer id in there so we'll just do that again which is fine all right so we do that right because like when we're passing in our spark our deep wire thing right here um we didn't select customer id uh we'll do something that's probably useful which is why we have like our transformation which is like our select statements and then our formula where we can like basically not put in our customer id because that can cause some problems um did i pass the yup transform so we can look at that and then we'll probably do a customer yep so we have our customer id oops customer id and then lastly we'll use collect which is going to basically do the actual query so we'll say like i'll call it uh i don't res right so res so that way we're basically saying like predict on uh given our given prediction we're gonna actually query it and get the results as a data frame in memory and ooh actually i forgot to select the uh prediction so let me do the right so we can see how we have our churn our result our customer id and then the probability of it being churned and then the actual probability gives it in like uh turn or no churn so we can say map dbl uh probability and it's a pluck uh was it two right and right here we can see that we have the probability of churn since the prediction is one we see that's above 50 percent that means it's like uh yes the probability of it act of a churn probably no turn it's a low probability and stuff like that so yeah um as you can see it's actually very easy to create um your models in spark um given this type of memory data set so oftentimes you'll have to do a bunch of joins and stuff like that um i'm not gonna go over deployment because i don't know how you would do it um with like your given system you probably have to save the file and stuff like that but what's really nice about these types of pipelines right here is that you can

### Segment 9 (40:00 - 41:00) [40:00]

just run it on anything you can put it you know probably um on your setup you can refit it pretty easily you can just save this pipeline and then just fit it to whatever data set you have or from maybe even a refresh table so you like schedule that like off hours and stuff like that um so again it's very straightforward package i personally really like using the uh tidy predict it's very fast and you can kind of make adjustments um it's also very useful when you want to do like a you know a taboo dashboard and stuff like that so um yeah this will conclude the thai tuesday series um i really enjoyed making these videos um the main reason why i'm not making more videos or not making more like regular scheduled videos is because i think i've kind of hit my ending where i've taught everything i wanted to teach this was basically the last thing i wanted to teach um and uh i figured i would kind of not overstay my stay on to youtube but i really hope you guys enjoyed all my videos i'm still going to be kind of active i want to do more project related videos but that won't be at a consistent schedule um so with that being said um i hope you guys had fun with my youtube channel and teddy on