# Build Churn Training and Inference AI from Scratch - End to End Machine Learning Project - Part 1

## Метаданные

- **Канал:** CodeWithYu
- **YouTube:** https://www.youtube.com/watch?v=Lca7UONLL9k
- **Дата:** 10.11.2025
- **Длительность:** 50:43
- **Просмотры:** 2,017
- **Источник:** https://ekstraktznaniy.ru/video/52951

## Описание

In this course, I’ll show you how to build a scalable, Churn prediction system from scratch—perfect for developers, data scientists, and tech enthusiasts! Learn industry best practices for machine learning model creation, retraining, and deployment, while mastering tools like Apache Airflow, MLFlow, Streamlit and modern ML workflows.

Full Video: https://youtu.be/gUJQGgZN8Ws

🔍 What’s Inside:
✅ End-to-End Architecture 
✅ ML Model Development (Feature engineering, training, and production promotion)
✅ Best Practices (Model retraining, performance optimisation, system scaling)
✅ Hands-On Walkthrough

Timestamps:
0:00 Introduction
1:17 System Architecture
4:00 Prerequisites and Installations
7:23 ML Services Dependencies
19:35 Setting up Training Dags
57:51 ML Pipeline for Improved Optimised Tuning
1:45:10 Streamlit Inference and Results Validation
1:53:55 Outro

📚 Resources
Full Source Code: https://buymeacoffee.com/yusuf.ganiyu/source-code-customer
For Youtube Members (Full source code 

## Транскрипт

### Introduction []

Hey guys, welcome back. In today's video, we're going to be diving into an exciting project. In the last couple of videos, we did sales forecasting. In this video, we'll be doing customer churn. These two goes eye in hand. You can, you know, forecast what kind of customers or what kind of sales you're going to be having in the business. But it's also important to understand the customers that you're having in the business. How are you retaining them? which customers have the tendency of leaving the business or going to another competitive product competitor product. Now in our case this particular products that we have are going to be a real use case. We have a data export from an e-commerce provider that has definitely provided us with some data. In this case this data has been anonymized and randomized as as much as possible. So we don't identify any customer with it. In this case, let's get into the architecture of the system and see what the archite look like end to end. Now we get started with this export. The

### System Architecture [1:17]

export is going to be coming from Excel. Now we have an Excel file that has been exported by uh an e-commerce provider and randomized and anonymize as much as possible to not expose any PII of any customers. So in our case we have the Excel file. This can also be configured to be from Postgress or even read directly from S3. This can be done easily by boto3 or by a if you have a connector with um postgress. Regardless of which one you decide to do, the simplest I believe is just to use the e-commerce data set from Excel file which is what we are trying to do in this video. Once you read that, we're going to use um a project in our airflow which is the DAG. This particular DAG is going to be reading from this Excel file and we are going to do uh data cleaning, feature engineering, machine learning trainings. We're going to be training two models in this video. One is logistic regression for forecasting and um we're going to be doing random forest as well. These two models will also be uh put head to head together and we select the best model out of these two regardless because you might not the same random forex may not be best suited for all these cases every time. Sometimes random uh logistic regression may be the best approach in this case. So we have a logic to um intelligently switch between logistic regression and uh our random forest algorithm uh model that has just been uh trained. Based on this we register that and in MLflow we register this model and promote it to production. Now until a certain threshold is met I think about 70% or 75% uh of the threshold is met in terms of accuracy ROC AU um accuracy. Once we do that and we are above 75% we can promote this particular model to production and this is going to be consistently done uh subsequently if you train the model and the performance is really bad it's not going to get promoted into production and this is how the ML flow and airflow handshake happens once this is done we have the model in production we we promote this to streaml and in our streamlit app we can just infer in real time from this model that has been registered on mlflow let's get into the system. I know this is something that is uh a little bit uh too much uh in theory. Uh so let's get into the uh nitty-gritty and get in touch with how everything you know connects together and how we stitch everything together at the end of the day to form a useful app. To get started, the first thing you need to do is get Astro installed. You can go to astronom. io/docs/astro

### Prerequisites and Installations [4:00]

CLI/install CLI. Once that is done, you can verify that in your terminal by typing Astro and you should see some command and help in there. Now, let's get started with the project. So, I'm going to start by creating a new project. And in this project, I'll call this project uh I'm just going to move this in here and change this to John prediction. Yeah, John prediction. Uh, so you can choose to use the latest version of Python, but for safety reasons, I'll use 3. 10. Um, that's usually stable. Um, most libraries are compat compatible with it, so there's no issues with that. Uh, we'll use CH prediction as the name of the project. And, um, if you prefer to create a G repository, you can do that. Um, I'll just create that. You can always do that later. All right. Um once that is set up uh wait for this to get bootstrap and the terminal set up properly in here I'll just touch main py just um just to be able to increase the size of the text I want to see I think this is fine and I'll remove this. All right so once that is done don't forget we just initialized or installed Astro in our local system. So we can just type in Astro to confirm that it's properly installed. So when you see something like this, you see Astro CLI, then it is properly installed in your local system. Now if you are using Windows, Mac or any other Linux distributions, you should be able to install this fairly quickly using the instructions on the page. All right. Now I I'll start with Astrodev in it. So to initialize Apache Airflow in this um root directory. So I'll say yes, initialize the project here. And you can see all the uh dags and the folders and the configuration settings are properly uh set up for me. Good. Now, so that's the first step. The second step is to now uh go in here and start up Astro. So you want to quickly start it up to be able to envisage if there's any issues or whatever. This is a bootstrap project, so it's not going to be a problem. So you can say astrodev start and then this is going to download the images uh into your local system and start up Apache Airflow running locally on your local system. So waiting for this to be built and we can get started. Okay. So the project is started up and you see what this look like. Um this is using airflow 3. 0. So you can see the d example astronaut. So you can trigger this but obviously we know this is going to work. So we don't need this. So I can delete the example tag in here and we can be rid of that. Alternatively, you can decide to just uh rename the content and just um get things um working by yourself. But all well and good, we're good to go with this. Now the first thing we need to do once this is properly set up and we are ready to uh to crack on is to first um get the docker override in place so we can set up our ML flow because we'll be training some models as discussed earlier. So we'll be training the model. So we need apach ML flow in here some other

### ML Services Dependencies [7:23]

services that will help us um facilitate the training. So let's kick into that. So if I go in here and do touch docker compos override. Now in this override I'm going to start with the mlflow uh which is the uh service that will be u storing model training and artifacts and the statistics. So I'm just going to have services in here and I'm going to have mlflow image is going to be from python 3. 11. uh if you prefer um Python 3. 11 I'll use the slim version right and then the platform is going to be Linux AMD 6 A RM64 ARM 64 not AMD uh the environment is going to be I've just used G Python get Python refresh is going to be quiet all right so the command to start uh our MLflow service is going to be I'm just going to put this here and I'm going to have bash. All right. Peep install MLflow. Uh you don't necessarily have to specify the version for people in for MLflow. Uh but just for um simplicity sake and for safety reasons, I'll do that. So otherwise, if you are installing the latest version of MLflow, let's say 3 something and some services are no longer uh supported, then probably some of this project pattern may not work. So I'm going to have psychopsy G2 binary G2s G2 binary G2 binary. Then I'll install B3 just in case we want to store our artifacts inside of um min. io. Um well essentially I guess that's all we need to do in here. So let's make a directory. um md make directory minus p mlflow artifacts artifacts. This is where we're going to be storing all the artifacts. So then we do a g mode 777 mlflow artifact. Then we can start our mlflow server. As simple as that. MLflow server. The host is going to be 0. 0. 0. 0 and the port in this case is going to be 5,000. Uh let's get our backend URI stored in here. So it's going to be um backend store URI which is done. Um we change this to MLflow MLflow. All right. And MLflow DB. Yeah. And the default artifact root directory is going to be the um artifact default- artifact-root. Okay. Uh this is going to be a file. And I'll put this inside of MLflow artifacts. And I think uh with that we should be ready to crack on. Wait, let's see one more thing. Uh let's serve artifact. All right. And I think that should be all we need to do. Uh let me see. MLflow MLflow. All right. Um yeah, let's just break this down a little bit. Okay, good. That looks okay. And um the volumes it's going to be MLflow artifacts in here. And then the ports. Let's get the ports in where we expose um our ML flow. So we exposing it on 50,0001 and it's going to be 501. Let's wrap this inside of single quotes. All right. So it depends on it's going to be MLflow DB because we need to get the DB inside as well. Uh essentially um finally we have the networks. is going to be airflow and default. So by default when you start Astro it comes with a default networks because it's already packaged. So we just need to leverage that um particular network. All right. So I just put unless you stop it then it should stop. Now that's our ML flow. So let's set up the DB. It's going to be a Postgress anyway. So it's going to be Postgress 16 Alpine. Uh we're not interested in the full version. So Linux ARM64. The environment is going to be let's get um MLflow as a user. The password is going to be MLflow as well. The progress posgress DB is going to be MLflow. All right. So the volumes in our case is going to be MLflow DB volume and that will be that. Then the health check is going to be let's have a test in here. Um cmd shell just remove this cmd. All right. PG is ready the user flow. Okay. And that's it. Um so the interval is 5 seconds. The retries is 5 seconds. Uh five interval retry. Um then we have our networks. It's going to be airflow and the default. That's fine. Then restartless stop. Good. Now let's get in our volumes and we're good to go. So volumes are going to be MLflow DB volume um MLflow artifacts and networks in this case is going to be airflow will be external which is going to be true and the default is going to be false is it? Let's get our name in here. So if you take a look at your um your docker dashboard, you see that you have a name chum prediction 4031_ac. So I'm just going to use that. So I'll say ch prediction is it underscore 40 31 AC. Yeah. So that's the network. So we just have a default after this. Um and in here I'll put a name to it as well as airflow. And the reason is because we want to connect uh to the airflow network that is set up by astronomer. So in this case uh make sure the name matches your project name in this case prediction 4031 AC_flow and default. And that's all you need to do. So you just need to start again and then MLflow and the DB will both be started. And you can see in here we have our ML flow in here which is getting started. The DB is there. Um our MLflow is also getting started. Let's see what's the issue in here. So it's saying could not find a value a version that says okay G2 binary. All right. uh is pyop. I type in pyop, isn't it? Says psychop g2 binary. So that should get it fixed. The project is starting up and it should be started now. Okay. So let's check again. So apparently I'm still ah this is psychop not cycop why do I keep making that mistake psychop g2 binary Okay. So, should be started now. Now, our MLFlow is here. We're good to go. So, all the installations are in place and currently running, which is good. So, that means our installations uh and setup is fine. So, we can get started with the rest of the rest of the system. Now let's create a uh a pipeline for training the model. So I'm just going to create in here and I'll call this John pipeline py. All right. Now before we start writing any code, we want to make sure that the data that we're going to be using is properly situated in the system. So I'm going to just paste the data set in here. And this is the data set we're going to be working with. So if I show you what this looks like, I'm going to open this in finder or rather I'll just open this with associated application. Then Excel should start up. Now you see what this data set looks like. So this is a real life uh data extracted from uh a particular company. So here we have um some columns which are the variables and their descriptions. You can see the data set is e-commerce. Uh these are the variables that we're going to be working with. So this is the unique customer ID. When you see churn then this particular customer has actually churned. The tenure is the tenure of the customer in the organization and the rest of the information keeps going. Now this is what the data set look like. Yeah, in actual sense this has been redacted in some form because there's no information about the customer. Uh let's say the first name or last name or stuff like that. But we have the customer ID which is a good use case. So we are not really exposing any PII in here. So the customer ID is here. Some customers have churned. uh we can check this uh against the rest of the columns. So this is what we need to look at. So if you are familiar with um machine learning and how it works, you probably know that you when you're starting to predict a particular uh data set or you want to predict the outcome of a variables of these variables, you want to make sure that this particular outcome that you currently have is removed from the data set. You can verify the output of the model if you once you are done uh by comparing what you had initially against what the model output is or you check the accuracy of the model after you trained it. But regardless what this means is if your out if your model is saying this customer is going to churn then based on that you can be able to predict why this customer might churn in the future and it's going to give you a percentage confidence as well. So which is good. So let's get into the core meat of the system and see what that look like. All right. So if you go back in here uh so this is what our data set look like. So we can just close it up and let's get started with that. So the first thing we want to do is get in our um imports. So let's say import

### Setting up Training Dags [19:35]

sis and we're going to say cis. path path dot insert um no home it's going to be user local airflow ducks all right so this is what we need in here and just to make sure that we are resetting the the directory to the right location so when we are reading from let's say files and all that it reads properly all Right. So uh from airflow decorators import task and dark. So this is scribbling because we don't have airflow installed. Apache airflow. So if you install that then this uh intellisense errors will go away. All right. Now uh so from pendulum let's get in our data date time. uh from pendulum import date time. So that should be fine. And we're going to be needing pandas as well. Uh if you prefer you can use any other libraries that you are comfortable with. Uh there are some other dropping replacement for pandas which works fine as well. So from typing import date we're going to be needing this and any um let's get our login in place as well. That's all. Um, so let's get our DAG in. Our DAG is going to look like this. So the start date is going to be date time. We need to install Pendulum. PIP install. Pendulum, isn't it? PIP install. So 2025. Um, the month is 11 and today is 8. All right. So the schedule is going to be weekly. You can change this to daily if you prefer. Catch up is going to be false. Then the tags uh this is not necessary but it's just to um to make things more readable and descriptive. All right. ch uh prediction and uh let's say e-commerce. All right. So the description of this pro the of this dag the des description if I can only type description is going to be uh production customer prediction pipeline with ML features. All right, good. And that's it. Um, so we can get started with this. So we can say uh our initial pipeline is going to be devon prediction pipeline. All right. So the first thing is to load uh for the first task we want to load the data from our Excel file. All right. — [snorts] — So the task in here is going to be uh let's call this load ecommerce data. All right. And we don't need a path in here. So it's going to be just empty. And in fact we don't need this. All right. So uh loading data from we don't need that. So that's really unnecessary. else we have our Excel path. It's going to be our e-commerce data set. So, it's going to be in user local airflow DAX. Uh but not in DAX. It should be in data, isn't it? And we have ecommerce data set. xlsx. All right. So, that's our Excel part. The sheet name that we need to work with is ecom if you can still recall. uh just to juggle your memory. I'm going to open this in associated finder application and you can see that we have ecom as the worksheet we want to be working with not the data dictionary. All right. So we have ecom and then login. info is going to be uh just to write something to specify what we're trying to do. All right. So loading data from this Excel path. So we can say DF is going to be read Excel the sheet name. All right. So login. info we're going to have loaded with let's get a shape with shape df. shape. Good. Now the columns info columns it's going to be dfc column. So let's change it to list. Okay. Good. All right. Uh so that looks okay. Now uh let's accept this uh login error loading data. Okay. Good. Now we've read uh hypothetically we've read the file uh we log the shape of the file the columns involved in the file and then we can say uh let's clean the data. So df is going to be clean and impute data [snorts] and this is what you want to do. Usually the first thing you want to do is to clean the data set and impute the respective data that will be used. All right. So I'll just create a function in here. I'll call this dev clean and impute data. All right. And we're going to be needing df which is going to be pd data frame. All right. So we get numpy in place. Import numpy as MP. It looks like we need to install numpy. All right. So numpy in there. So we say initial rows is going to be length of our data frame. Then login. info info we say starting data cleaning for this rows. So we say let's get our missing summary. It's going to be df is no sum. So we get the total null values in this um data set. then the null percentage and this is going to be very useful uh when we want to get the percentage and see how we can imputee the data whether it's so if you have high imbalance and a lot of data is no that may not be useful but if it's le let's say 2% or 1% is no value then you can impute with some strategy in this case all right so I'm just going to close this out then run it into uh two decimal places. All right. Um yeah, that's fine. Multiply by 100 run two. Okay, good. Now for the column inside of um DF in fact is it should we use DF directly? So let's say missing summary and let's get the missing summary index. All right. So if the okay so we have the missing summary is g than zero that's fine. Uh let's just log something in here. login. info info we are saying column has missing this percentage. All right. Now the first thing we want to do is remove the churn column. All right. So if churn this particular column we want to remove it because it's not going to be useful when we are trying to do our prediction. So it's not a variable it's a target output. Yeah. So we don't need this. So we say if John is in df doc columns then let's remove it. df equals to df. drop. We drop the column. All right. So we say df. drop drop and then we can use a subset. Our preferred subset in this case subset is going to be ch. All right, good. Now let's remove that. It's a target variable. So that's number one thing we need to do. We say remove rows where target variable is it rows? It should be column in it. Remove column for target variable. All right, that's fine. Then number two, the next um transformation we want to do is handle ID columns. So we should remove the duplicates but keep the data. All right. So if there's a duplicate in customer ID, we want to remove the duplicated columns. So we say DF is going to be let's even get the duplicated ids first. Duplicate ids is going to be df. duplicate third um the subset and then we keep the first one. All right. So let's get the sum of all of them. How many records do we have? So if the duplicate id is greater than zero then let's remove them. Okay. So DF is going to be drop duplicates. The subset is customer ID and we keep the first uh column the first record that exist. So we say dropped. Yeah, that's fine. Now that is number two. Number three, we now have our numerical imputation strategies that we want to apply. Now this is now left to you based on the data set you're working with. But in my case, let's get um this inside numerical imputation strategies. Yeah. So let's get our numerical columns. All right. So we said select types which is including the numbers the columns then to list. Let's convert it to list. Now we say again if there is ch in the numerical columns then let's remove it. All right. Otherwise, for each of the columns, let's see what we have. Uh yeah, so for each of the numerical columns, let's get this. If df the column name yeah if the column is null and the sum is greater than zero so that means we have more than one column uh that is null in this case then if the column is in tenure you can check it out teno order count in fact I'll just put this side by side so I don't make a mistake. If it is in tenure, tenure order count. I'm just going to order count. Coupon used. I'm just remove this. I'll minimize it. Coupon used. and uh number of device registered. Okay. If any of the columns is including that we are going to be using the median. We use median um median for count like variables. All right. And that would in that would mean a df column fill n a medium. So if you have any column that is inside of any of these four we use the median because they are usually count. Yeah. Otherwise if it is not um any of those four columns. So let's say l if column inside of satis to avoid typo error satisfaction score. All right. So for satisfaction score we're going to be using mode in this case. So for the imputation so we use not mode we use the mean uh not mean we use the mode rather. Yeah. If Yeah. Uh this is wrong. If do mode empty else we use three. Let's see else use three. Okay. So we are saying if it is a satisfaction score and satisfaction score looks like this. satisfaction score. So you have three two um and the rest like that. So you see 1 2 3 4 5. So we are trying to use the median uh in this case. So even if we are using whatever mode or that or median it doesn't matter what matters is we are imputing this correctly. So the safest is to just say more like indifferent when you are imputing something like satisfaction score. So we're saying if the mode the first record in there is empty then use um three otherwise use that particular mode record. All right so we're saying fill na I'm just going to move this away from here and that should fill it fix it. Okay good. Now um so that's for our satisfaction score. Otherwise, if the column is in cash back amount, cash back amount. Let's see what that look like. Um, cash back amount. All right. Or we have let's see other amount Ike from last year. We use zero in this case. Use zero for monetary values. This is assuming zero means no cash back. And the reason is because uh when you check the cash back value, we have a couple of records in here. So when you see zero um the assumption is there's no cash back. But if you see any record, then there's a cash back value. All right? Or no Ike. So in our case, we fill NA with zero. Then let's get more into the picture. If the column name is inside of day since last order, then we use median in this case. Use median but we cap it at median but cap this at a reasonable value. All right. So the median in our case is going to be the column name median and then we cap it. So let's say um median we cap it as f na if the median is less than 10 otherwise 10. All right. Alternatively, you can just um use median directly and you're good to go. So, uh either way works for me. Uh it's just to be able to uh to fill the data correctly as expected. All right. So, any other columns we just default to median. Let's default any other columns. All right. me median for everything else. Okay, good. Now, um I know this is a long show, but let's quickly recap what we did in here. We're saying if the tenure if the column name is in tenure, order count, coupon use, or number of device registered, we use median for that. If it is satisfaction score we use um this the mode we using mode in here. All right. So if it is uh cash back amount or other amount like from last year we use monetary value as zero. Um if it is this since last order we use median. Any other thing is median. But if you send something in here, we we're kind of repeating some logic in here. So essentially, we don't necessarily need this and this, but I'll leave it as it is. Okay. All right. So that's our numerical um I'll just change this to say imputed. numerical. Good. I'll leave that. All right. Now, so that's our first one, our numerical imputation strategy. Now, let's get our um categorical columns imputation strategy. So, the number four is going to be uh to get our categorical columns. So we did that by saying select the types of objects which is um usually the way you get the numbers that are not numerical the columns are not numerical and we're saying uh let's change this if customer ID in categorical columns let's remove the customer ID all right so we're not needing that then we say for column in categorical columns If the column name if df call is null the sum is greater than zero. Then we now need to identify some categorical columns that we want to use just like we did for the numerical columns. So we start with the column um in preferred payment mode or preferred login type. Preferred payment mode, preferred login device and preferred order category. Okay, then we use mode for those ones because they are like preference column. So we just use mode for our imputation strategy. So we say use mode for preference columns and the mode value is going to be DF. Yeah, mode. Let's see. We're saying use mode uh if column. No, that's wrong. So we say if not df call emptys unknown. That's fine. And then we say df column for this particular column. We are filling that with the mode value any na value in there. So that's fine. Now L if the column name is in gender or marital status we're using mode for this then graphic columns all right then we df yeah that's fine any other thing we use mode essentially what we're doing is to use mode um for this but we just want to make sure we take care of individual columns uh if you get what I mean. So we say default to mode for everything else. All right. Good. I suppose our categorical column imputation is complete. Now let's do the final cleanup. I'll say final clean up after all the imputations are done. Um, we only drop rows if they still have critical missing data and check for any other ones. So check for any remaining values in critical columns. All right. So our critical columns will be critical columns would be John tenure and satisfaction score. Satisfaction score in it. All right. So those are our critical columns. So if any of these critical columns are missing, so we say um DF crit critical missing. Yeah, we get that. So if we have any of these critical missing, then we need to uh we need to drop them. All right. So before before critical drop is going to be length of df and df is going to be drop the column and then login. info info. Yeah, that's fine. Now, uh this should be number five, is it? Number five and number six. Before we continue, we say data validation and uh maybe cleaning. All right. So, these are just for outliers. So, these are for outliers. All right. So our outlier column is going to be I'm going to remove this and call this 10 cash back order count. All right. So for the column in ATLA columns if the column name is inside of let's get Q1 is going to be quantile 25 Q3 75 and IQR is going to be um Q3 minus Q1 and then we can get our lower and our upper bound is going to be lower bound bound lower Q1 - 3 * IQR. So we are using So just to comment this out, we're using three times IQR in here to have a less aggressive outlier remover. All right, for right. And the upper band is going to be um Q3 plus that. That's fine. Now, anything outside of this, we try to remove them. All right. So, we say our DF uh let's remove this outliers. It's going to be DF column in lower bound or DF column in upper bound. So we get that. So if our outliers are greater than zero, we say yeah that's fine. And then we remove. So what we're starting to say is we specify the upper and lower bound. Yeah. So 3 QR is like 3 * 0. 25. So we go less than that in our lower bound and the upper bound. So we specify a new bound for what the data will look like. Anything outside of this are considered outlier. So we can remove them to avoid issues uh and imbalance in our data set. So our final rows is going to be this. Our retention rate is going to be that in percentage. Then we can just say data imputation complete. um retention rate is that and that's it. We can just return DF. All right. So that's all we need to do uh for our cleaning and imputation of data. Now once that is done, we go back into where we came from. After this data is cleaned and imputed. So we have our final rows. Our final rows is going to be length of DF. So we can say data imputation complete. These are the final rows that remain after the cleaning is done. All right. Now ensure in here that the churn column exist and is properly formatted because this is going to be very useful for us. Ensure churn column uh exist and properly formatted. All right. So ch is going to be zero or one essentially. So we say churn is going to be df as type integer not boolean. CH rate churn rate as now it's going to be the main and then we say login. info info data set loaded and we have yeah turn rates good else we say login info no column found in data set. All right. And then we return error no ch column found in data set. Good. Yeah, that's fine. Now let's convert uh into a serializable format so we can properly serialize it. Um we say our data is data dot to dictionary and we orient the records. The number of customers is going to be length of df and the churn rate is going to be that let's convert this to float. Okay, good. All right. Um I suppose that's it really. Uh let's raise an error in here and call this runtime error. Uh failed to to load data. All right. So that's our first task and right now it looks like we are rocking and rolling. Good. Now once we simplify this uh our first task is properly done. We can minimize that. The next one. Uh, is there an error in here? Missing percentage is that. So, it looks like we're not using um, yeah, that's fine. I suppose we are okay with that. Okay. Now the next one is going to be to validate.
