Building recommendation systems is hard. In data science, we can spend months wrangling data, training models, and still end up with mediocre results. That's where Kumo AI comes in — it's a service that abstracts away the complexity of building Graph Neural Networks (GNNs) for predictive analytics.
In this guide, we'll build a complete e-commerce recommendation engine using real H&M data with 33 million transactions. By the end, we'll have a system that can:
- Predict customer lifetime value for the next 30 days
- Generate personalized product recommendations
- Forecast purchase behavior to identify active customers
All of this can be done in just a couple of hours - not months.
📌 Code:
https://github.com/aurelio-labs/cookbook/blob/main/recsys/ecommerce/kumo-hm/kumo-hm.ipynb
💡 Kumo AI: https://bit.ly/4gduL04
👾 Discord:
https://discord.gg/c5QtDB9RAP
Twitter: https://twitter.com/jamescalam
LinkedIn: https://www.linkedin.com/in/jamescalam/
#datascience #machinelearning #python
00:00 Kumo AI and GNNs
07:39 Kumo Setup
12:17 Kumo Connectors
14:45 Getting Data into BigQuery
20:39 Building the Graph in Kumo
28:34 Predictive Query Language (PQL)
35:01 Personalized Product Recommendations
38:44 Predicting Purchase Volume
41:44 Making Predictions with Kumo
27:10 Analysis and Prediction with Kumo
52:36 When to use Kumo
Оглавление (10 сегментов)
Kumo AI and GNNs
Today we are going to be doing a full end toend walkthrough of Kumo. Now as a quick introduction to Kumo, it is a almost data science as a service that allows us to simplify a lot of what we as a data scientist would be doing in a analytics use case. So it's best I give you an example. Let's say we are a e-commerce platform. You are a data scientist in that e-commerce platform and your goal is given your current historical data. You may want to one predict the lifetime value of a particular customer. You may want to generate personalized product recommendations for those customers and also try and forecast purchase behaviors. So in the next 30 days, what is this customer most likely to purchase and in what quantities? So as a data scientist, if you're going to go and do that, it is a fairly complicated process which will take a bit of time. And the reason for that is this type of data set. So let me show you what this type of data set might look like. You could be looking at something like this. So you may have a customers table, a transactions table, and articles or products table. Now your of course your customers table. This is actually the data we're going to be using is structured like this. We're using the H& M e-commerce data set and that data set has these three tables. The customers table there is 1. 3 million records. So that's 1. 3 million customers. And what you're going to need to do is connect that customer's data over to your transactions data. Okay? So you're going to have your customer ID connecting those two here. You're going to have in here the transaction date and the price of that transaction. So that's going to be pretty useful information when it comes to making all these predictions. On the other side, within this transactions, we don't necessarily have the actual article or product information that's going to be stored over here in the articles table. So, you'd connect these two. And in here, you'd have your product name, the type of product it is, you'd have like the color of the product, a natural language description. There are also in this data set images that you can attach, although we're not going to be using them, but there's a lot of information in there. So, you're going to have something like this. And your job as a data scientist is to take this data which is pretty huge and transform it into business predictions that engineers, marketing, leadership, you know, whoever can then go and act upon. So it can be a pretty hard task. And the way that you would do this went where you have all these different connections between many different tables. One of the best model architectures for doing this is graph neural networks. And the reason that graph neural networks are good at this is because of the relationships between different tables. Graph neural networks are very good at mapping out and understanding those relationships. You get a better grasp of network effects. So in this scenario that is how different customer preferences may influence other customer preferences because of course these customers are not acting in isolation. They're all part of a broader world. So you have those network effects which graphical networks can handle better than many other predictive models. You can model temporal dynamics which is essentially fancy was saying you can model predictions over time. So around Christmas time for example if you have data going all the way back to previous years Christmas and the year before and so on. It's there's probably going to be an a relatively obvious like pickup in purchasing volume, but also purchasing different things, especially when you're looking at, okay, what should I be recommending to a customer? You know, it's summer. Should I be recommending them a big fluffy coat for like Arctic exploration? Probably not. But should I be recommending them swim shorts, sunglasses, you know, these sort of things? Probably, right? And a graph neural network if you give it enough data will be able to do that. And another problem that you'll see and you'll see this across many disciplines not just recommendation here. Another very common problem is the cold start problem. The cold start problem is referring to okay you have a new customer they just come in. You don't have any information about them. Okay or you have very little information about them. You might have okay they are male, female, their age, their geography. You might have that information and what you can what a graph network can do based on that is say okay looking at this baseline bit of information who are some other customers that seem similar. Okay. And based on that limited amount of information, let's start making some predictions anyway. And yeah, maybe we'll get some wrong, right. But that is far better than just like kind of giving up, which is what the cold start problem is. Is where you just don't have enough information to start giving out at least reasonable recommendations. With this, these recommendations are not going to be as good as if we had more information for that customer. But with graphing networks, they're probably going to be better than most other methods. So why do I even care about growth now right now? Well, that's what Kumo is. Kumo is a service that abstracts away a lot of the complexity when it comes to okay, getting our data, parsing it, you know, doing like data exploration, data prep-processing and cleansing. Kumo will handle all that for us and then it will also handle the training of our graph neural network and then the prediction making of our graph neural network. So that all together means that as a data scientist something that would maybe take you a month to go through and do all this maybe more or less time than that depending on the project and the competency level of course rather than going through and doing all that you can actually do all this in a few hours which is still okay. It's a it's some time, but in comparison, it's it's pretty fast. And the level expertise that has been put into building Kumo is pretty good. I would say better probably better than the average data scientist. Okay. And even better than even a pretty stellar data scientist. So, the quality of predictions that you're going to be getting out of this is probably going to be better than trying to do it yourself in most cases. Maybe not all, but in many cases, but in any case, it's so much faster that like this would seem like the way to go. And this is particularly exciting for someone like me who yeah I did data science you know at the start of my career but I've you know mostly moved away from that and I do much more general engineering work obviously a lot of AI engineering work but I'm not so much in the like model training and straight up data science space anymore. So being able to make these sort of predictions and integrate that with you know some products or services that I'm building that's pretty cool. So
Kumo Setup
enough of an introduction here. Let's actually start putting together our end to end workflow for Kumo. So we're going to be working through this notebook here. You can run this either locally or in Google Collab, but I will be running this locally and you can find a link to this notebook as well in the comments below. But let's start by running through and connecting to Kumo in the first place. So you will need to set up an API key. There will be a link for signing up in the comments below. Or if you already have an account, you probably know where your API key is. For me, I can go into my Kumo Workspace here. I can go to admin and I can get my API key here on the right. Okay. I would have to reset and generate a new API key if needed. I'm not going to do that. I already have mine. Alternatively, if you don't have this, you at least the way that I first got my API key was via email. So you can wherever you can find your API key go for that. Okay. So once you have your API key you should put it inside kumo API key and I'll just pull this in here. So I'm going to get and kumo API key or if you want to just paste it straight into the notebook it will run get pass here. Okay, you should see that this will successfully initialize the SDK and you can come down here and there are multiple ways of getting your data into Kumo. Generally speaking, with these sort of projects, you're going to be using a lot of data, right? It's usually a big data thing. So, Kumo does integrate with a few different infrastructure providers for data. You can also just upload from local, but in this example, we're going to be using BigQuery. You can also use S3, Snowflake, Data Bricks, and I think a few others as well. But I'm going to be focusing on BigQuery just because I'm most comfortable with GCP. But again, it doesn't really matter just as long as you have access to one of those. So, I'm going into my GCP console. I'm going to BigQuery here. And I've already created a couple of tables in here. But if you hadn't, you don't need to do anything here. But you will need to go over to your IM controls. And what we need to do is just create a service account that is going to give Kumo access to our BigQuery. And you can see the permissions that we're going to need in here. So data viewer, filter, data viewer, metadata viewer, read session user, and data editor. So we need all of these in our service account. So let's go ahead and create those. Well, I've already done it. So you can see in here I have my service account. It has all those permissions. Now once you have created your service account, again, just another thing like you call it whatever you want. You don't need to call it Kumo or anything like that. That's just the name I gave it here. So, I'm going to go over to service accounts. I have my service account here. And what I'm going to do is just come over to keys, and you would have to add a key here. So, you' create a new key. I've already created mine. So, I I'll just show you how you create a new one. So, you do JSON, you click create, and then you'll want to save this in the directory that you're running this Kumo example from. and make sure you call it this here. So, kumo gcpc creds. json. Now, the reason that we need this to set up the accesses I've just described is fairly simple. So, we're going to be using GCP and BigQuery over here as the source of truth for our data. So, our all of our data is going into Google BigQuery. So, that's the customers data, transactions data, and articles data. All of that is going into BigQuery. Okay, Kumo needs to be able to read and write to BigQuery. So, we set up our service account, which is, you know, sort of the thing that I've kind of highlighted here. We set up our service account, give that to Kumo, and then Kumo can read the source data from GCP. And when we make predictions later on, we're going to be making predictions and writing them to a table over in GCP. Okay, so that's why we need to make this connection. So the next thing we want to
Kumo Connectors
do after we've set up those credentials is we need to create our connector to BigQuery. So there's a few items to set up here. Uh we have the name for our connector which I have set to kumo intro live. We have the project ID. So this is the GCP project ID. So you can see mine is up here, right? I have this Aurelio advocacy project. So just make sure that is aligned. And then we will also want to set a unique data set ID for the data set that we're going to create. And this is going to be used to read the d to read and write the data set that you have over in BigQuery which by the way if you don't already have data set in there we are going to go and create that but right now you can see it in here. So I have this hm so that's the h& m data set over here. Okay again if you don't already have your data set in bigquery it's not an issue but I'm going to show you how to do that. So that's the setup. We read in our credentials file here and then we use that to initialize our BigQuery connector with Kumo. Okay. So this is from Kumo AI. Now if we let me even change the data set ID here. I'm going to change this to two. And of course I need to change this to two very quickly. So we we'll change those. Okay. So now when I come down here and I try to first view our tables, it's not going to let me. Okay. So I'll run this and it's going to throw an error. Okay. So I've got 500 actually 404 which is come over here. Not found data set or obviously HM2. Okay. Now I don't want to go and upload everything to BigQuery again because it takes a bit of time. So, I'm actually just going to drop that and switch back to the project that we just created. And now that I've connected to a project that actually does have my data, if I run this, it will not throw an error, right? So, yeah, it just connected. No errors. That's because the data now exists. But, of course, if you're following this through for the first time, you don't have that data already. So, let's take a look at how we get that data and then put it into
Getting Data into BigQuery
BigQuery. So, as I mentioned, we're going to be using the H& M data set. The H& M data set is a real world data set with 33 million transactions, 1. 3 million customers in the customers table, and over 100,000 products or articles. So, it's a pretty big data set, but that is similar to the sort of scale that you might see yourself if you're a data scientist working in this space. This is the sort of thing that you would see in a production recommendation system. So we're gonna come through down here and I just want to start with okay where we download and say etc. So we're going to pulling it from Kaggle which is the only place I think there was like a copy of it on hugging face as well but I don't want to use that cuz I don't think it's an official copy. So we're going to use Kaggle. Now, to download data from Kaggle, you do need an account. Okay, slightly annoying, but it's fine. So, you just sign in or register, you know, whichever. And once you've signed in, you should see something kind of like this. What you need to do is go over to your settings. Scroll down and you want to go ahead and create a new token. Now, you can download this wherever you want. I would recommend okay just download it into the directory that you're running the notebook from. So I'm going to go and do that as well. And that will download the Kaggle. json file. Now once you've done that, you're going to want to move that Kaggle. json file. So I'm going to do move Kaggle. json to and this is on Mac. So I'm going to move it to Kaggle Kaggle. json. So now when I try and import Kaggle here, Kaggle is going to read my Kaggle. json credentials from here and actually allow me to authenticate. Otherwise, it will throw you it will throw an error. So you will see that okay, if you try and run this, it will throw an error if you haven't set that up correctly. So for this specific data set we will need to approve their terms and conditions which we can do by we can just find the data set quickly. So H& M unlock data sets or sorry competitions and we have this H& M personalized fashion recommendations competition. So this is the one we're going to be working with. What we can do is Oh, so around here somewhere there'll be a little thing that tells you need to approve the or accept the terms and conditions to use. You need to go and click that. Once you found it and clicked it, you will be able to download the competition files using this method here. Okay. So, we're pulling this from H& M personalized fashion recommendations data set. Now, that can take quite a while to run. And I'm not going to download it myself cuz I already have it locally. But once you do have it, everything will be in a zip file that looks like this. So we need to extract our data out from that zip file. And we're only going to be looking for the CSV files. There's a lot of other files in there. images and everything. We're just looking at the CSVs. So we pull those out. And if I should be able to run this bit, you can see that this is the sort of data that we have. We don't this is a sample submission data set. We we're not interested in that. We're just looking these first three. So customers, articles, and transactions train CSV. And now we need to go ahead and place our data from our local device and throw it into BigQuery. So to do that, we are setting up we're going to do this directly with BigQuery. So from Google we're importing BigQuery importing service account which was how we authenticate ourselves. We have our credentials file path. This is what we got before the from the service account in GCP. So we create a credentials object using service account credentials and then we use that to initialize our BigQuery client. And again we're this is using the Aurelio advocacy project. Okay. So this is the project within GCP that we're using. So we can run that. And then what we're going to want to do is we use our this is data set ID. We actually define this earlier. So we probably shouldn't define it again here. So we have our data set ID. We're going to use that to create a data set reference with. So like okay in GCP we have a client authenticated client. We're saying okay connect to this data set object and this is the data set ID. This will work even if the data set doesn't exist because what will happen is if it doesn't exist. So we're going to try and get the data set. That's going to throw an error if it doesn't exist. So we catch that error and we say okay data set doesn't exist. Now we're going to go ahead and create it which is exactly what we're doing here. Okay. So we're creating that data set. So I'm going to run this. For me, it's going to say data set already exists because I've already created that data set. But if this is your first time running this, your data set will be empty and it's just been created. So with that, you would need to go through the files in this. So that is going to be the first three files here. So customers, articles, and transactions. And this is essentially setting everything up or setting up your table and then just pushing over that data to the table. Okay. Now again that can take some time. So I'm not going to run that but this will take a little bit of time to actually run. Once we have that set
Building the Graph in Kumo
up we need to move on to actually building our graph in Kumo. Okay. So, as we saw briefly before, this is what the data set looks like. And based on that, this is how we're going to be connecting everything. So, we have our customers table. We connect that via customer ID to the transactions table. And then the transactions table is connected to the articles table via the article ID. Okay. So, let's go ahead and define these tables and define these relationships. Okay. We'll run through this. So first we can just check the tables that we currently have. So I have quite a few tables in here already. You won't see all of these. So you should only have customers articles and transactions train. They should be the only ones that you see. All these other ones like all these prediction ones here are generated by Kumo later. So at the end you will have some of these not all of them but you will have most of those but right now just the three. So we want to connect to each of the source tables that we should have. So customers articles transactions train and we do so like this. So we just do connector. It's like a dictionary lookup there. So we have articles customers transactions train just a table names. Okay. And what that does is it creates within the kumo space it sets up the source tables. A source table is not a table within kumo. It's a table elsewhere. Okay, so it's a you know it's literally like okay this is the source of my data. That's what the source table is. Now we can view these source tables with a sort of pandas date frame type interface here. So we use head and we'll see the top five records in our table here. So this is looking at the articles table. You can see product code, product name, the number, the type, uh some more descriptive things. There's quite a lot of useful information in here. Okay, as you can see and again, this is the articles data set. So, roughly about 100,000 records in there. Now, looking at this, we should be able to see columns here. My bad. Okay. And we can see all the columns that we're going to be using here. So you see very first one here, article ID. So we know the articles can be connected or we will see in a minute that the articles can be connected via this to the transactions table. So okay, we have that. Let's come down. Let's take a look at our customers table. We'll just look at the first two records here. Okay. So, customers table, we see we have customer ID. So, that's what we're going to be using. Again, some other little bits of information here. Mainly, actually, this is empty. I think I'm pretty sure I've seen age in a few of those. So, yeah, I think maybe this is just a couple of bad examples. Then moving on to the transactions source. So, we can go through here. Let me run this. So there isn't much information in the transactions table, but it's a big it's a lot of data. So we can see the transaction date, the customer, so we can connect that to the customers table, the article ID, articles table, the price, and then sales channel ID. So we're going to connect all of those up. The way that we're going to do that is we're going to use this Kumo AI table. So we're going to initialize a kumo table and we're going to do that by pulling it from a source table an existing source table. So we'll have the articles table, customers table and transactions train table. The primary key for articles is going to be article ID. The primary key for customers is going to be customer ID. And then for transactions data there isn't actually a primary key but there is a time column. So we do just highlight that. So there is a time column there which is the transaction date. So we can initialize that and we can go and have a look at the you can see that we've done this infer metadata that's it's an automated thing from Kumo where it's going to look at your table and it's going to infer the table schema for you. So I can come down here and I can look at the metadata for my articles table for example. Okay. So we have article ID, product code, it's got all data types in there whether it's primary key something else. So a lot of cool stuff in there. And we'll also be able to see all of these tables now in our Kumo dashboard. So I have a few in here. So let me scroll through. We're going to want to find the previous columns here. I think mine would be well it' be the most recent ones. So that would be May here. This one, and this one. And you can see some just useful information for each one of these columns. So let's go into maybe article would be interesting. So we can see if we go to let's say product name or product type name. Cool. So you can see there's a lot of trousers we have. Okay. Trousers, sweater, t-shirt, dress, you know, so on and so on. Okay. So there's some really useful information there. You just go in and take a look through those. So we have that. And now what we will also need to do is okay we have our tables but they're all kind of they just exist in kumo independently of one another at the moment. Now we want to train our graph neural network on these. So we actually have to create a graph of connections between each one of our tables. The way that we do that is we initialize this graph object. We set the tables that are going to be within this single graph. Then we define how they are connected. So we use these edges. So we have the source table. using transactions with both of these. So the transactions table via the customer ID foreign key is going to connect to the customers table. Second connection here. So again transactions table via the article ID foreign key is going to go and connect to the articles destination table. Okay. So we would run this and then we just run graph validate to confirm that this is an actual valid graph that we're building. Okay. So everything went well there. And yet again we can go over into our Kumo dashboard and it would be I suppose my last one here which is May. I need to zoom out a little bit there. So we have yeah it's pretty straightforward. So we have the transactions data here. We have our transactions datetime column there. We have these two foreign keys. So the customer ID connects over to our primary key of the customers table. And then the article ID foreign key connects to the primary key which is article ID of the articles table. Okay. So you can see those connections and you can click through if you want to and then see what is in those. Like we just did that so I'm not gonna I'm not going to do it again. Okay. Uh another thing that you can do if you want to visualize your graph within notebook, you can install a few additional uh packages and you can actually use the graph viz package to visualize that. I'm not going into that here because for every like for Linux versus Mac and I assume Windows as well, you need to set that up in a different way. So you you can see the graph in the Kumo UI. So I will just do that personally. It's up to you. So, we've set everything
Predictive Query Language (PQL)
up, right? We're at that point now. This is almost like we have been through the data cleansing, the data prep-processing, the data upload, you know, all those steps as a data scientist. We've been through all those steps and now we're getting ready to start making some predictions or training our model to make some predictions. Okay, that's great. We've, you know, we've done that. You know there is quite a lot going on there but nothing beyond what we would have to do anyway as a data scientist. So what we've done really simplified quite a bit of work and condensed it into what we've just done now. But now we need to go into the predictions. So Kumo uses what they call the predictive query language or PQL. Now it's quite interesting. So PQL you might guess is kind of like a SQL type uh syntax which allows you to define your prediction parameters. Okay. So rather than writing like some neural network training code you write this SQL like uh predictive query and Kumo is going to look at that understand it and train your GNN based on your PQL statement. So let's start with our first use case that we described at the start which is predicting the customer value over the next 30 days. So the way that we do that in a PQL statement is like this. So there are there's a few different components in PQL. So we have our target which you can see here. So the target is what follows predict here. So we have the predict statement. We're saying predict whatever is within our target here. So this is the defined target. So what is our target here? Okay, we have the sum of the transactions price over the next 0 to 30 days into the future and also we defined days here. So that is our target. We're predicting the sum of the transactions price over the next 30 days. But then we also have an entity. Okay, which is who or what are we making this prediction for? So here we're saying for each. So we're getting these this sum of predictions broken down for each individual customer. So what we do by writing this query is we are getting the value of each customer over the next 30 days. So let's go ahead and implement that. We come down here, we use predictive query. We pass in our graph and then we just write what I just showed you that PQL statement. Okay. So predict the sum transactions price for each customer based on the customer ID. Uh we validate our PQY or PQR statement here. So let's run that. Okay, that is great. Then we come down here and we ask kumo to generate a model plan. So basically a okay kumo how based on everything based on our data here the volume of data based on the query that we've specified what is the ideal approach for training our model and yeah you can look at this. Okay, so we can see here that we're using mean absolute error, mean squared error and root mean squared error as the training plus functions. The tuning metric is actually is using mean absolute error. We have uh network pruning. We there's no processing there. We have sampling the optimization here. So it's using the Huba uh loss function to optimize for regression. There we have a number of epoch sets here and to work over the validation steps, test steps, the learning rates, weight gate, just I mean I'm not going to go through all of this, but there's a ton of stuff in here. Uh we can actually see the graph network architecture here, which is probably interesting for a few of us. And yeah, just a ton of stuff in there. So you can yeah, you have Kumo telling you what it's going to do, how it's generating your model. So if you're, you know, I I'm not wellversed in JNN's, but if you are, you can take a look at that and see make sure everything makes sense according to how you well you understand them. But of course, as I said, Kumo literally co-founded by one of the co-authors of uh the GNN paper. So they have some pretty talented people working there. So that should be some pretty optimal uh parameters there. So once we're happy with that, what we're going to do is we're going to run this trainer object. So we Kuma AI trainer, we have the model plan and then we run it with trainer. fit. Okay. And what this is going to do, okay, let me run this. What this is going to do is it's just going to initialize the training job. Now we're going to see this cell finish quite quickly because we've set non-blocking equal to true. So it's going to this is going to go to Kumo. It's going to sort like say, okay, I want this training job to start running. Once it has confirmation that the training job is running, it's going to come back to us and allow us to move on with whatever we're doing. But the training job will not be complete for quite a while. I think as I've been going through this, the time has varied, but I would say somewhere between 40 minutes to an hour um for a training job here, but you can run multiple training jobs in parallel. So we have now three predictions that we'd like to make here. So I'm going to run all those at the same time. Okay, we've got this one back now. So you can as well if you don't if you want to just start running these all now. Just run the next few cells in notebook and then come back and I'll talk you through what the other use case PQL statements are doing. So we can check our status for the training job. We'll see that it's running. We can also click the tracking URL here and this will actually open up the training job so we can see you know how things are going in the UI if we want to. So
Personalized Product Recommendations
coming back let's move on to the second use case which is these personalized product recommendations. This is one I personally like I would actually be very likely to use with a lot of the projects I currently work on which is obviously more like AI conversational AI uh building chat bots or just AI interfaces in general. The reason that I can see myself using this is let's say you have the H& M website. I don't know what's on the HM website but let's say they have a chatbot and you can log in. You could log in and you could uh talk to this chatbot or not even talk. It doesn't have to be a chatbot. It can just be that okay you log into the website and the website is going to surface you know some products that based on what we do here based on these personalized product recommendations that we build with Kumo it can surface those to the user as they log in so that you are providing them with what they want before they even they don't even need to go and search it's just it should just be there what they want they if possible right obviously you're not going to get perfect all the time. But you you'll probably be able to do pretty well with this. So, you can do that. You can also surface this, as I was originally going to say, through a chatbot like interface. You could tell a chatbot, hey, you have this customer and you're talking to them. Uh these are the sort of products that they seem most interested in. You know, kind of place those into the conversation when you're able to when it makes sense. So, that was another thing that you could do. There are many ways that you could use this. So this is a little more of a sophisticated prediction here. The reason I say this is a little more sophisticated is because we have a filter here. So we've added an additional um item to our target entity. So we have target entity and now we also have filter. So the target up here is pretty similar. Okay, the operator is different. So we're not summing anymore. for actually listing the distinct um articles that over the next 30 days we expect to appear in the customers uh like transactions table. Okay, so this is the top 10 and what it's saying by okay what are the top 10 uh these are like the top 10 predictions. So will the customer buy this or not? Okay, will this customer ID appear in the transactions table alongside a particular article ID in the next 30 days. That is what we're predicting here. And then we're filtering for the top 10 predictions because otherwise if we don't filter here, we're going to be looking at what was it like 1. 3 million customer IDs, unique customer IDs with against 100,000 products and we're making predictions for all of those, right? That would be that would that could be a larger number. Okay. So, what we're saying is, okay, just give me the top 10, like the top 10 most probable purchases for our customer. So, we would run that again. Same as before, nothing new here. So, we're just modifying the PQR statement. So, we run that, we validate it, and we're just going to check again. Okay, we get the model plan from Kumo here and then we'll just start training with purchase trainer fit. Okay, so that's going to run again as before. We will be able to check the status with this. Of course, we'll just have to wait for that other cell to run.
Predicting Purchase Volume
Now, final use case here. I want to look at predicting the purchase volume for our customers. So, in this scenario, it's kind of similar. So we're looking at account of transactions this time over the next 0 to 30 days. That generates our target. Again, we're looking for each customer. Okay. But we're adding a filter here. So what is this filter doing? This filter is looking at if you look here at the range, we've got minus 30 days up to zero days. So this is looking at 30 days in the past. past is saying okay let's just filter where the number of transactions so the count of transactions for each customer ID over the past 30 days is greater than zero. So what does that mean? That's saying just do this prediction for customers that have purchased something in the previous 30 days. What this does is it just reduces the scope of the number of predictions that we have to make by focusing only on active customers. So ideally we should be able to get a faster prediction out of this by you know within the data set naturally there's probably a lot of customers that are just inactive probably not going to get much information for but if we don't add that filter in we're still going to be making predictions for those customers. So you can do this ac across the other examples as well to just make sure we're focusing on like customers that we want to focus on. So as before, we're just setting up the predictive query, validating it. We do the model plan and then we fit that model plan. Same as before, no difference other than the query of course. Okay. And once that cell has finished up here, we can go here and just check the status of our jobs. I would expect them to either be running or cute as they are here. So, I'm going to go and leave these to run and this one as well and jump back in when they're ready and show you what we can actually do with these queries. Okay, so we're back and the jobs have now finished. So, we can see done in all these. So we can switch over to the browser as well and you can see in here that these are done. So training complete. This one took an hour 20 minutes. So it was pretty long to be fair but you can see yeah you can see the various ones here. The this one here 16 minutes pretty quick. I would imagine that is the one where we had the filter. Yeah. You see here we're filtering for the like active customers only and yeah the the sort of duration for that one is noticeably shorter. Uh which makes sense of course. So that's great.
Analysis and Prediction with Kumo
table. Okay. So we would run this and then we just run graph validate to confirm that this is an actual valid graph that we're building. Okay. So everything went well there. And yet again we can go over into our Kumo dashboard and it would be I suppose my last one here which is May. I need to zoom out a little bit there. So we have yeah it's pretty straightforward. So we have the transactions data here. We have our transactions datetime column there. We have these two foreign keys. So the customer ID connects over to our primary key of the customers table. And then the article ID foreign key connects to the primary key which is article ID of the articles table. Okay. So you can see those connections and you can click through if you want to and then see what is in those. Like we just did that so I'm not gonna I'm not going to do it again. Okay. So we have that. Uh another thing that you can do if you want to visualize your graph within notebook, you can install a few additional uh packages and you can actually use the graph viz package to visualize that. I'm not going into that here because for every like for Linux versus Mac and I assume Windows as well, you need to set that up in a different way. So you you can see the graph in the Kumo UI. So I will just do that personally. It's up to you. So, we've set everything up, right? We're at that point now. This is almost like we have been through the data cleansing, the data prep-processing, the data upload, you know, all those steps as a data scientist. We've been through all those steps and now we're getting ready to start making some predictions or training our model to make some predictions. Okay, that's great. We've, you know, we've done that. You know there is quite a lot going on there but nothing beyond what we would have to do anyway as a data scientist. So what we've done really simplified quite a bit of work and condensed it into what we've just done now. But now we need to go into the predictions. So Kumo uses what they call the predictive query language or PQL. Now it's quite interesting. So PQL you might guess is kind of like a SQL type uh syntax which allows you to define your prediction parameters. Okay. So rather than writing like some neural network training code you write this SQL like uh predictive query and Kumo is going to look at that understand it and train your GNN based on your PQL statement. So let's start with our first use case that we described at the start which is predicting the customer value over the next 30 days. So the way that we do that in a PQL statement is like this. So there are there's a few different components in PQL. So we have our target which you can see here. So the target is what follows predict here. So we have the predict statement. We're saying predict whatever is within our target here. So this is the defined target. So what is our target here? Okay, we have the sum of the transactions price over the next 0 to 30 days into the future and also we defined days here. So that is our target. We're predicting the sum of the transactions price over the next 30 days. But then we also have an entity. Okay, which is who or what are we making this prediction for? So here we're saying for each. So we're getting these this sum of predictions broken down for each individual customer. So what we do by writing this query is we are getting the value of each customer over the next 30 days. So let's go ahead and implement that. We come down here, we use predictive query. We pass in our graph and then we just write what I just showed you that PQL statement. Okay. So predict the sum transactions price for each customer based on the customer ID. Uh we validate our PQY or PQR statement here. So let's run that. Okay, that is great. Then we come down here and we ask kumo to generate a model plan. So basically a okay kumo how based on everything based on our data here the volume of data based on the query that we've specified what is the ideal approach for training our model and yeah you can look at this. Okay, so we can see here that we're using mean absolute error, mean squared error and root mean squared error as the training plus functions. The tuning metric is actually is using mean absolute error. We have uh network pruning. We there's no processing there. We have sampling the optimization here. So it's using the Huba uh loss function to optimize for regression. There we have a number of epoch sets here and to work over the validation steps, test steps, the learning rates, weight gate, just I mean I'm not going to go through all of this, but there's a ton of stuff in here. Uh we can actually see the graph network architecture here, which is probably interesting for a few of us. And yeah, just a ton of stuff in there. So you can yeah, you have Kumo telling you what it's going to do, how it's generating your model. So if you're, you know, I I'm not wellversed in JNN's, but if you are, you can take a look at that and see make sure everything makes sense according to how you well you understand them. But of course, as I said, Kumo literally co-founded by one of the co-authors of uh the GNN paper. So they have some pretty talented people working there. So that should be some pretty optimal uh parameters there. So once we're happy with that, what we're going to do is we're going to run this trainer object. So we Kuma AI trainer, we have the model plan and then we run it with trainer. fit. Okay. And what this is going to do, okay, let me run this. What this is going to do is it's just going to initialize the training job. Now we're going to see this cell finish quite quickly because we've set non-blocking equal to true. So it's going to this is going to go to Kumo. It's going to sort like say, okay, I want this training job to start running. Once it has confirmation that the training job is running, it's going to come back to us and allow us to move on with whatever we're doing. But the training job will not be complete for quite a while. I think as I've been going through this, the time has varied, but I would say somewhere between 40 minutes to an hour um for a training job here, but you can run multiple training jobs in parallel. So we have now three predictions that we'd like to make here. So I'm going to run all those at the same time. Okay, we've got this one back now. So you can as well if you don't if you want to just start running these all now. Just run the next few cells in notebook and then come back and I'll talk you through what the other use case PQL statements are doing. So we can check our status for the training job. We'll see that it's running. We can also click the tracking URL here and this will actually open up the training job so we can see you know how things are going in the UI if we want to. So coming back let's move on to the second use case which is these personalized product recommendations. This is one I personally like I would actually be very likely to use with a lot of the projects I currently work on which is obviously more like AI conversational AI uh building chat bots or just AI interfaces in general. The reason that I can see myself using this is let's say you have the H& M website. I don't know what's on the HM website but let's say they have a chatbot and you can log in. You could log in and you could uh talk to this chatbot or not even talk. It doesn't have to be a chatbot. It can just be that okay you log into the website and the website is going to surface you know some products that based on what we do here based on these personalized product recommendations that we build with Kumo it can surface those to the user as they log in so that you are providing them with what they want before they even they don't even need to go and search it's just it should just be there what they want they if possible right obviously you're not going to get perfect all the time. But you you'll probably be able to do pretty well with this. So, you can do that. You can also surface this, as I was originally going to say, through a chatbot like interface. You could tell a chatbot, hey, you have this customer and you're talking to them. Uh these are the sort of products that they seem most interested in. You know, kind of place those into the conversation when you're able to when it makes sense. So, that was another thing that you could do. There are many ways that you could use this. So this is a little more of a sophisticated prediction here. The reason I say this is a little more sophisticated is because we have a filter here. So we've added an additional um item to our target entity. So we have target entity and now we also have filter. So the target up here is pretty similar. Okay, the operator is different. So we're not summing anymore. for actually listing the distinct um articles that over the next 30 days we expect to appear in the customers uh like transactions table. Okay, so this is the top 10 and what it's saying by okay what are the top 10 uh these are like the top 10 predictions. So will the customer buy this or not? Okay, will this customer ID appear in the transactions table alongside a particular article ID in the next 30 days. That is what we're predicting here. And then we're filtering for the top 10 predictions because otherwise if we don't filter here, we're going to be looking at what was it like 1. 3 million customer IDs, unique customer IDs with against 100,000 products and we're making predictions for all of those, right? That would be that would that could be a larger number. Okay. So, what we're saying is, okay, just give me the top 10, like the top 10 most probable purchases for our customer. So, we would run that again. Same as before, nothing new here. So, we're just modifying the PQR statement. So, we run that, we validate it, and we're just going to check again. Okay, we get the model plan from Kumo here and then we'll just start training with purchase trainer fit. Okay, so that's going to run again as before. We will be able to check the status with this. Of course, we'll just have to wait for that other cell to run. Now, final use case here. I want to look at predicting the purchase volume for our customers. So, in this scenario, it's kind of similar. So we're looking at account of transactions this time over the next 0 to 30 days. That generates our target. Again, we're looking for each customer. Okay. But we're adding a filter here. So what is this filter doing? This filter is looking at if you look here at the range, we've got minus 30 days up to zero days. So this is looking at 30 days in the past. past is saying okay let's just filter where the number of transactions so the count of transactions for each customer ID over the past 30 days is greater than zero. So what does that mean? That's saying just do this prediction for customers that have purchased something in the previous 30 days. What this does is it just reduces the scope of the number of predictions that we have to make by focusing only on active customers. So ideally we should be able to get a faster prediction out of this by you know within the data set naturally there's probably a lot of customers that are just inactive probably not going to get much information for but if we don't add that filter in we're still going to be making predictions for those customers. So you can do this ac across the other examples as well to just make sure we're focusing on like customers that we want to focus on. So as before, we're just setting up the predictive query, validating it. We do the model plan and then we fit that model plan. Same as before, no difference other than the query of course. Okay. And once that cell has finished up here, we can go here and just check the status of our jobs. I would expect them to either be running or cute as they are here. So, I'm going to go and leave these to run and this one as well and jump back in when they're ready and show you what we can actually do with these queries. Okay, so we're back and the jobs have now finished. So, we can see done in all these. So we can switch over to the browser as well and you can see in here that these are done. So training complete. This one took an hour 20 minutes. So it was pretty long to be fair but you can see yeah you can see the various ones here. The this one here 16 minutes pretty quick. I would imagine that is the one where we had the filter. Yeah. You see here we're filtering for the like active customers only and yeah the the sort of duration for that one is noticeably shorter. Uh which makes sense of course. So that's great. We can just jump through and look at the look at how we can use those predictions now. Okay. So to make predictions we can come down to the next cell here. I've just added this confirm. Okay. Like the status is done. which it has. We've already checked but just in case then we're going to use trainer predict. So the first trainer if we come up here to where we actually initialize that it's this one here. So that first one which is this PQL statement right here. Okay. So predicting the essentially the value of the customer over the next 30 days. So let's go ahead and run that prediction. And what this is going to do is actually create a table in BigQuery. You can see I put output table name here. So it's going to create this table. Okay. So once we've run that, this table will have been created. Now the other thing that we should be aware of is so for our second query this one here we are ranking the top 10, right? So this is a ranking prediction and that means that we can have a varying number of predictions per entity. Okay. So in that case we also need to include the number of classes to return. Okay. So we don't have to say it's 10 like we said top 10 before. We can we could change this to like 30 if we wanted to, right? But yeah, we're sticking with 10 here. So, yep, that's a that's just a nice parameter that we can set if we want a different numbers of a different number of predictions to be returned there. And if we come down to the next one, we have the uh transactions prediction. So that is looking at the number of predictions for our customers over the next 30 days if I'm run. So we run all of those then we can actually go in and see what we have from our data. So first one is the customer value predictions. So who will generate the most revenue for us of our customers. Okay. So when we specified before this was the table name the output table name Kumo will by default add underscore prediction to the end of that or predictions sorry. So yeah ju just be aware of that it does change the table name but then yeah we can see this now it's worth noting that the head that we have here is actually the like the these are all the lowest predictions right and this is a regression test. So essentially it's saying that all of these customers are going to over the next 30 days some of their transactions will be zero. Okay. And because it's regression it goes slightly into the negative. Right? But essentially just view this as being zero. That's our prediction. Now as I said these are all this is like the tail end. This is all the lowest number of predictions in the head here. So we actually want to reverse this. Kumo doesn't give us a way. it doesn't allow us to write like tail like you would with pandas data frames. So instead we can actually use big query directly to order by the largest values and just get the top five from there. So this is what we're doing here. We write that as a SQL query. So we're selecting all from our data set. We're going for this table. So this is the new table that we have from here and we're ordering by the target prediction. Okay. So this number here and that is descending. So we have the highest values at the top. So this is going to give us who should be our top five most valuable customers. So let's take a look. Okay. And we have these right. So much like these are high numbers, right? So these are the entity here is our customer ID. So, we'll be able to use this to map back to the actual customers table pretty soon. So, now we have who will most likely be our most valuable customers, which is a great thing to be able to predict. Now, let's have a look at what we think these people will buy. Okay, so we're going to look at our purchase predictions. Just looking ahead again here. Again, we can just use uh BigQuery and go directly through that if we need to. But here we can see okay for this customer they are like very likely to buy this product here. Okay. And it's pretty high score there. It's cool. So now let's have a look at transaction volume. We're going to bring all this together. We're just looking at each table now quickly. We're going to bring all this together in a moment. Okay. Transaction table. How active will the B be again? And again these are very small numbers. So we use big query to look at the largest values there. Okay. So these are again these are transactions. Again I think we're looking this would be the customer ID here and this is how many transactions we actually expect them to make in the next month. So 20 transactions. First one here. Okay. So all that is great but how can we bring all this together? So, what I want to do now is look at next month's most valuable customers and join that back to our actual customers table. I then want to see what those customers are most likely to buy the actual products and then again focusing on those customers see how many products we think they will buy. So, let's do that. First, we're going to find next month's most valuable customers. How do we do this? So to identify our most valuable customers, we're going to be doing a join between our the sum of transactions predictions table with our customers table. We're going to be joining those two tables based on the transaction predictions, the entity values there, which is just a customer ID from our predictions and joining those to the actual customer IDs from the customers original customers table. Okay, so that is what we have there. We're also limiting this by the top 30 highest scores. So you can see we're ordering by the target prediction and looking at the top 30 of those. So basically filtering for the top 30 most valuable predicted most valuable customers. So I will run that. Let's see what we get. Can do to data frame and just view that. So we have okay we have the customer ID now. So these are the top 30 predicted most valuable customers. So we can come through here. We can see okay we have all the ages here. call the young 20s are the ones buying all the clothes of course with this random 37y old over here. Okay. And then we can see their scores. Okay. So that looks pretty good. We can see that they're all club members, whatever that means. I assume they must have some sort of membership club. So generally okay that looks pretty good. So we have our top we could do top 30 here but right now we're just looking at the top five the most valuable customers. Now let's have a look at what those customers are going to buy. So come down to here. We now need to be joining our customers table which is here. join our customers table to the purchase predictions table based on the prediction entity. Okay, which is the customer ID. So customer ID is attach is joining the customers table with the purchase predictions table. Then the purchase predictions table also included a I think it was class number which is the article ID or product ID. So we're going to connect those. So it was a predictions class. We're joining that to the articles table via the article ID. And that is going to give us our customers most likely purchases. And we're actually going to focus this. So we're going to filter down to a specific customer which is our one our top rated uh customer from the top customers table which we create here. So we're going to be looking at this person here. So let's run that and see what we get. Okay, I've got a few here. So customer ID, this is just the same for all these. We're looking at that single customer and we can see what they are interested in buying. Okay, so product name magic. So some magic dress that they have the then they have this Ellie good dress. Another dress. Some heeled sandals. A shirt. A dress again. Some joggers. Uh dress dress. They really like dresses, I think, from the looks of it. So, these are the products that when this user next logs into a website or if they're talking to a, you know, H& M chat box or anything like that or we're sending emails out, these are the products that we should send and just surface to that user. So, like, hey, look, these are some nice things that we think you probably might like. So, hey, look at these. That was pretty cool. Now, let's continue and take a look at the purchase volume. So, we're now looking at the valuable customers. So, this query here, let me come back up and show you what that query actually is. So, the valuable customers is this query here, right? So, uh finding out our top was it top 30? Sorry. So, top 30 most valuable customers. So, we're going to be joining to that table here. We're going be joining our top 30 most valuable customers to the transaction predictions and what that is going to do is going to get us the predicted transaction volume for each one of those top 30 customers. Okay, so that's what we're doing here. So let's run that and take a look at what we get. Okay. So, yep you can see so for our customers here, this is the expected transaction volume and yeah, we've worked through our analysis there. We've
When to use Kumo
gathered a lot of different pieces of information and as I mentioned at the start there this is sort of thing that as a data scientist would one be hard to do like training the GNN is like you need a lot of experience to do that and to do it well it's not you know it's not impossible but it's going to be hard and to do properly. So Kumo is really good at just abstracting that out which is really nice. And then the other thing that I think is really cool here is that okay, if you're a data scientist, maybe you'd want to go and do this yourself, although you would save a ton of time and probably get better results doing this. You know, it's up to you. But the other thing is that this means that not only data scientists can do this, right? So especially for me as a sort of more generally scoped engineer, I want to build products. I want to bring in analytics on the data that we're receiving in those products. And usually for me to do that, okay, I we can do some, you know, data science stuff, but the results, one, it's going to take a long time for me to do it, and two, the results probably won't be that great. with this I will have the time to use Kumo to set that analysis and two it will actually be a good analysis unlike what it would probably be like if I did it myself. So I get to have like very fast development and also get world-class results which is amazing. So yeah, incredible service in my opinion. This is just one of the services that Kumo offers. There is another one that I'm I will be looking into soon which is their relational foundation model or Kumo RFM. That is something I'm also quite excited about. So, we'll be looking at that soon. But yeah, this is the full introduction and walk through for building out your own data science pipeline for recommendations on this pretty cool e-commerce data set. But that's it for now. So, thank you very much for watching. I hope all this has been useful and interesting. But for now, I will see you again in the next one. Thanks. Bye.