# Learn ETL Pipelines in Databricks in Under 1 Hour | Data Engineering in Databricks

## Метаданные

- **Канал:** Alex The Analyst
- **YouTube:** https://www.youtube.com/watch?v=Vht4hoRHEek
- **Дата:** 28.04.2026
- **Длительность:** 1:04:56
- **Просмотры:** 6,986
- **Источник:** https://ekstraktznaniy.ru/video/49807

## Описание

In this series we are going to dive into the Data Engineering side of Databricks!

This video will orchestrating jobs to automate our data pipelines.

Get the Data Here: https://github.com/AlexTheAnalyst/DatabricksSeries/tree/main/Data%20Engineering

Try out Databricks Free: http://signup.databricks.com/?provider=DB_FREE_TIER&utm_source=youtube&utm_medium=video&utm_campaign=AlextheAnalystDE

____________________________________________ 

RESOURCES:

💻Analyst Builder - https://www.analystbuilder.com/

📖Take my Full MySQL Course Here: https://bit.ly/3tqOipr
📖Take my Full Python Course Here: https://bit.ly/48O581R
📖Practice Technical Interview Questions: https://bit.ly/46pDqqL

Coursera Courses:
Google Data Analyst Certification: https://coursera.pxf.io/5bBd62
Data Analysis with Python - https://coursera.pxf.io/BXY3Wy
IBM Data Analysis Specialization - https://coursera.pxf.io/AoYOdR
Tableau Data Visualization - https://coursera.pxf.io/MXYqaN

*Please note I may earn a small commission for

## Транскрипт

### Intro []

What's going on everybody? Welcome back to another video. Today, we're going to build ETL pipelines in Databricks in under 1 hour. Now, in this video, we're going to cover several different things. First, we're going to work on data ingestion, just getting data in. Second, we're going to actually build out our ETL pipelines. And then third, we're going to work on data orchestration or creating jobs in Databricks. At the very end, we're going to have a full end-to-end project where we pull data in from a folder in AWS S3 bucket. Then we automate it with an ETL pipeline to clean that data. Now, this video is made from several shorter videos that we have done on Databricks in previous lessons, but we're putting them all into one long video, so you can watch it all at one time. Let's not waste any more time. Let's jump into the first part, which is data ingestion.

### What is ELT [0:48]

Before we jump into this data engineering series and start doing all the things, I want to slow down for just a second to take a look at what ELT is in Databricks. This is the process that we're going to be walking through for this entire series or extracting data or getting our data into Databricks. That involves loading the data into different schemas and having that data available. And then we transform that data. Now, ELT might sound odd because most people are used to ETL where you extract data, you transform it, and then you load it into the database. With a lot of modern data workflows, it doesn't actually make much sense to transform your data before because compute is quite cheap these days. And so, you can just load your data into Databricks and then transform it after. Now, there is something called the medallion architecture. We're going to take a look more at that in the next lesson when we take a look at bronze, silver, and gold architecture. Now, this is a really great way to kind of stage your data, and it's been like this for a long time, even before Databricks. But, we'll be going into why and how we actually do that within Databricks. Our data when we actually get it into Databricks is being stored in a Delta table. It's kind of like a Delta file type, which is basically just a parquet file that has this log system where you can kind of revert back and see previous changes to the actual document. And so, we store our data in these Delta tables, and then we can do all of our transformations on it within a notebook or within SQL queries. Now that we've got that out of the way, let's actually jump into Databricks and see how we can

### Data Ingestion [2:10]

do this. All right, here we are on Databricks, and we're going to be doing two things. One, we're just going to upload a CSV file. It's probably the simplest way to get data into Databricks. But then we're also going to connect to an S3 bucket. And so, I'm going to show you how you can do that really easily. And we're going to get all of our data into Databricks. Now, we are just working with sample data for this lesson. But at the last one when we start doing our full ETL process and automating this entire thing, then we'll be using real data. And so, it'll be a lot more and a little bit more complex. There's a few ways to ingest data. One, you can just click on this bring in data, and it's going to take you right down here. But we can also just go to our data ingestion. And so, when we click on this, we're going to upload files to a volume or we're going to create and modify a table. Now, these are two separate things, and these are important things to understand. Let's actually come over here to catalog for a second. And what we're going to do is we're going to come over here, and we're going to create a new catalog. And we're just going to call this one our data engineering uh that's all we're going to call it. I was going to keep going, but we'll call it the data engineering one. And let's go ahead and view this catalog. Now, when we create a schema, we're going to say this is uh video one. Let's go ahead and create this. We have within our data engineering, we have our default, and then we have this information schema, but we also have this video one. Now, we don't have any data in this schema. But what we can do is we can create different ways to store our data. We can store it in a volume, or table. Now, in a previous series, I kind of dove into these and how you can store your data as well as how to access the data once you put it in. We're going to be putting all of our data into tables, so I'm just going to set up one table. Now that we're here though, we can do the same thing that we would do if we came over to our data ingestion, which is basically just drop a file in here like you would on any platform. Let's just go over to the data ingestion just so we get the full experience. We're going to create or modify a table. We're going to select our CSV file, so it's just our users_dirty. I'm going to go ahead and upload this. So now we have this preview of our data, and we're going to specify what we want to do with it. We could create a table, we could overwrite an existing table. And we want to put this in our data engineering video one. So if it doesn't automatically populate, you can always just specify where you want to place it. And we're going to call this _CSV because we're going to be bringing in the same file from an S3 bucket. So I just want to specify where we got this data. Let's come down here, and we're going to create our table. So now we have our data sitting in our video one schema. So this is our users_dirty_CSV. This is the simplest way to get data into Databricks. But I also have this exact same data sitting right over here in an S3 bucket. And I want to use it. I want to connect to this data. I want to pull it in automatically. And that's going to really help us later on down the line when we start automating this whole process because we're going to create a connection to this data source so we can automatically pull this data in. And that's a big part of just data engineering in general, which is creating systems that can automatically ingest, transform, and load your data. So what we're going to do is we're going to come right over here. Now, I just want to show you this. I'm going to have a link down below so that you can see this as well. But this is basically just how you're going to create the connection. I'm going to show it to you in a second. It is very, very simple. So, let's come right over here. And what we're going to do is we're going to come down to data ingestion. Now, we want to go to the Databricks connectors, and our Amazon S3 bucket right here. And we need to create an external location. So we're basically connecting our Databricks account to our Amazon S3 bucket. And then we can bring that data in very easily. What we're going to be using is this AWS quick start. Let's go ahead and select next. We need to put in our bucket name. So I'm going to come over here. Let's click in here. We can actually get it right here. There's other places to get it, but I'm just going to copy it from here. Uh we're going to go back to our catalog explorer, and there's our bucket name. So now, what we're going to do is we're going to generate this new token, and we're going to copy this. Now, we're going to come over here to launch in quick start. And all we have to do is it's going to connect to our account, so that makes it pretty easy if you're already logged in. Then we're going to come down here, and we're going to say I acknowledge, create stack. It's going to come right here, and it's going to say create in progress. It's just going to validate it for a second. I had already done this uh before when I was making this video earlier just to confirm that everything was working smoothly. And what it's going to do is it's just going to say create complete, and then you're going to be good to go. All right, so that took about 2 minutes, and it says it is complete. So all we're going to do is come back here, and we're going to refresh this page. So I'm going to go ahead and refresh. And now that connection is active, we now have access to our users_dirty. Let's come over here. We're going to click on this, and we're going to go to preview table. So now we get this preview of the exact same data set. There's nothing changed. I'm not trying to trick you. All we have to do is we're going to come up here to data engineering, and we're going to go to video one. And then we need to name this. So I'm going to call this one dirty_data_S3. So I'm just naming it this purely so that we know which one came from which. Let's come down here to create table. And now we can see over here we have our dirty data_S3 and our users_dirty_CSV. I named it completely wrong. But these are the exact same data sets, and now we have them in from two separate locations. Now, this is really important, especially as we start automating a lot of this. If you have data that's sitting in an S3 bucket, and you have other systems that then upload it into that bucket, we're going to be able to ingest that data automatically, whether it's updated or if it's a new file, and we'll set all sorts of triggers and schedules and all sorts of really cool things in later lessons. So that's how we ingest data within Databricks. In our next lesson, we're going to be transforming data within an actual data pipeline. So we're going to have the entire ingestion process as well as the transformation process all in one place. Now, in the last lesson

### Building ETL Pipelines [8:25]

we worked on data ingestion into Databricks. So we were able to connect to just a local file, just kind of reading that file in. And then we were also able to connect to an AWS S3 bucket. Now that we have that data pulled in, and we have it actually sitting in our schema, we need to clean this data up a little bit. And so, we're going to need to transform this data, which is part of the extract, transform, and load within an ETL process. So we extracted it, and we ingested that data, and now we need to clean up our data cuz it is messy. So that's the transformation piece of the ETL process. Once we have this done, we can put it into an ETL pipeline, and then it sits there, and it does a lot of the heavy lifting for us. And we'll talk about that in this lesson. Now, really quickly, before we jump into things, I want to talk about this bronze, silver, and gold medallion architecture that is very popular within Databricks. Now, we've actually already covered this bronze level, which is just our raw data. We ingested our data from our S3 bucket, and it's just sitting there in this raw format. This is data that we are just never going to touch. What we're going to do is we're going to create transformations on that data, and then we're going to put it into a different table or even a different schema or catalog. When it gets to that and the data is actually changed, that's going to be in our silver. So the silver layer or architecture is basically just once you clean it up, and you have it in a lot better state where there aren't a lot of duplicates, issues with the data, that's where it's going to sit where you can then transform it into your gold architecture or layer. Gold is just production ready. You already start using this data. You're going to put it into dashboards. reports. You're going to put it into your apps, whatever you're using that data for. Back when I was just using Microsoft SQL Server or any other tool, we would call this raw, staging, and production. The raw is bronze, the staging is silver, and of course the production is gold where we actually use that data. So, what we're going to do in this lesson is we're going to actually use this. We already have our bronze, we need to transform my data into silver, and then find a business use case to create the gold table. So, now that we have this background information, let's go on to our screen and start building this out. So, in the last lesson, we brought in this dirty data_s3. And this is what our data looks like. We have this user ID, first name, last name, their email, their sign-up, the country, and referral source. Now, this is just our raw data. This is our bronze layer right here. Now, in this video, we're not going to do it exactly how I would do it in the real world. I'm just going to kind of keep it all in one place for us. So, within this video_1 or within, you know, whatever schema you created, we're going to keep our silver and our gold tables all within this one schema. That is not typically how it is done. Here's what you typically would do. You're going to have a data engineering_bronze catalog, then you have a data engineering_silver engineering_gold catalog, and all these catalogs would hold the different levels. And so, you're not just usually working with one small project like we are in, you know, this lesson, but typically you're working with lots of different projects and customers, and you want those to be separated out so you don't kind of get them confused and you don't know which data you're supposed to be hitting off of. That typically is how it's done in a real workplace environment. We're just going to do it all right now within this video_1 schema. So, this is our bronze layer right here. This is the file that we are going to be using. Now, in order to transform our data to get it to silver, here's what we need to do. Let's come up to our new Let's go down to our notebook. So, we have this notebook right here. Let's call this bronze to silver transformation. There we go. And I'm going to give you a little spoiler here. We're going to create another notebook, and we're going to call this one silver to gold. And so, we want to separate these out. You don't have to separate these out, but for the sake of what we're going to be doing in this lesson, I do want to show you how kind of you set things up and you actually, you know, organize things within an ETL pipeline. And then in the next lesson, when we look at jobs and orchestration and automation, this will also come into play, and I'll talk all about that. So, let's create these two different things. Now, I can write all this out because, you know, I know this data set. It's pretty simple, and I already know what's wrong with it. So, I can go in and I can just fix it. I can write this out manually, but you know, let's get a little creative. Let's take a look at how we can use AI in order to see if it can do most the heavy lifting for us. Now, in our sample data, I'm going to give you two things that need to be changed cuz there's only really two big issues. The first thing is in this date column, we actually have it as a string, and that's a problem, right? We need it to be a date column, and the issue is this right here. We have one date field that is 2. 29. 24 instead of the forward slashes. That's an issue. We also have a usr_1009 as a user ID, and if we go down, we have a 1009 right over here. So, we have a duplicate user ID. And in a primary key like a user ID typically would be, that's an issue. So, we have two issues we need to solve. I am going to try to get the AI, which is the Agentic AI, which is this one up here that we're going to be using, to try to write this out and get it right. So, let's come back to our workspace. Let's come to our bronze to silver transformation, and let's bring up our AI assistant. Now, I'm going to describe what I want it to do, and then we're going to see if it's able to write it out. I myself could write it out very accurately in probably maybe 3 to 4 minutes, but this is not a coding tutorial. I want to show you guys how ETL pipelines work in Databricks, not how to necessarily transform the data. So, let's try this out. So, I'm going to say, "Take my data set," and I can pull it up over here just so we can see it. I'm going to go to data engineering, video_1, just so I can see the data. "Take my data set in the data engineering catalog in video_1 schema called dirty_data_s3. " Now, I like to be super explicit cuz I don't want there to be confusion, especially as you have like hundreds of tables, you don't want it to read in the wrong tables. I like to be super explicit. We're going to ask it I'm going to say, "There is something wrong in the date column making it a string. I want you to identify and fix that issue. There's also duplicates in the data set. I want you to remove duplicates on the ID. " Now, I'm being a slightly vague, right? I'm not telling it exactly what it needs to do, but I'm going to let this run, and we're going to see if it's able to identify the issues and write the code. I do want it in Python. I think that's just the easiest way to transform this data. And so, I'm going to say, "Use Python and Pandas. " And let's give it a go. So, let's let this thing for just a little bit, and we'll see what it comes up with. took about a minute or so. It did a lot of different things, and now it wants to actually run this code. Now, before we do that, you can have it ask every time, or you can just allow it to run the code after it's done. I'm going to ask it to ask every time just because, you know, I want to make sure. Now, it does have a lot of printing just to show the work that it's doing. I myself don't want this in my output, so I will ask it to change that in just a second. But, it does identify that there's a period, it replaced it with a forward slash, it converts it to two date time, which looks correct, and it also formats it for us. Then comes down here, and it's doing just a ton of kind of pretty unnecessary things before it gets to this df. drop_duplicates on the user ID, and we're keeping the first one, which is perfectly fine. And then lastly, it's doing a lot of verification. I basically don't want 80% of this code. I just want the simple stuff. So, all I'm going to say is, "I like the transformations you've done, but get rid of all the print statements. " All right, it looks like it's done. As you can see, it cleaned up the code immensely. This is really looking good. I'm going to go ahead and I'm going to accept all. You can see the diffs down here, by the way, for all the code that it's writing or taking away. We're going to accept all, and we are going to run this ourselves. We can I'll just click run all here. But, we're going to run this ourselves, and then we'll verify and make sure that this actually looks good. So, let's open this up. Let's come down here, and let's just do a display. We'll do data frame_clean, which is what it named it. So, now let's look at this new data frame that it has created. That should be a lot cleaner than before. So, now if we come down here, we have our 1009. Let's go see if our 1009 was removed. It was. And let's come over here to our sign-up date, and it looks like that now is converted to a timestamp, which is perfectly fine. We could also do it as just a date column, but honestly, it really doesn't matter. This is a great change, and it cleans it up immensely. So, now it's all standardized, it's actually in a date column or a timestamp column, and that works great. Now, all we need to do as the last part of this process is we have to write this table to a new table, and that's going to be our silver table. So, I'm going to come down here. I'm going to say, and I could put it as the Genie code, or I can come over here. I tend to like using this side a lot more, I don't know why. But, I'm going to say, "Write this cleaned table to a new table in the same schema, and call it s3_cleaned_silver. " And so, let's go ahead and let that run, and it should take just a second, and we'll have that code for us. Let's go down here really quick. We have this. This looks great. I'm going to allow this to run for us. So, it's going to run this code. Now, it is giving us this warning, and this is a very fair warning. We're using overwrite right here, and basically what we're doing is every time we run this, we're overriding the previous data that's in that table. For now, I'm just going to use that cuz it's not a huge deal. You know, as you start getting more sophisticated with your data pipelines, you are going to want to think about things like adding data to your existing data instead of overriding, but you know, that can get a little bit more advanced depending on your data and your data need. Now, it's going to run this, and I'm going to accept all, and then let's come right over here to our data engineering, video_1, and now we have this s3_cleaned_silver. So, our bronze to silver transformation is complete. This is all we needed to do in order to transform our data. And now we have our raw data, and let's come actually back to our catalog, and we'll just take a look at this. We can get rid of our Genie code real quick. So, we're going to come over here. So, our raw data is still going to be raw. Let's go ahead and run this. This is our bronze level, right? We still have the raw data, duplicates. But, when we come over to our silver, this is now going to be our cleaned level. So, now that we have all of our transformations completed, we've taken it from bronze to silver, now we want to create our silver to gold transformations as well. Let's go back to our workspace, and we'll come down here to the silver to gold, which is going to still be right up here for us. Now, let's give it a use case, right? We could use this table just as it raw, and we could hit off of it, and we could build dashboards and all sorts of things. Sometimes you want to track certain KPIs or certain things that you can't just get from the raw data. So, I'm just going to give it a simple use case, let it write it out, and we'll create our silver to gold transformation. So, let's come right over here. down here. I'm going to say that I want to know the best day of the week that people are clicking on certain ads, and we're going to see what it creates for us. So, I want to create a new table called insights_gold, and I want it to show me the best days of the week and what ads people clicked on the most. And let's run this and just see what it does. All right, so it went and did a lot of work for us. It did not take long. This is maybe 15 seconds. It's doing some group buys on some different columns, and then it's getting some counts for us on different sign-ups and referral sources. Let's go ahead and allow it to run this, and let's see what it does. Now, it's giving us a few things as far as outputs. One, this first one is extracting the day of the week and analyzing sign-up patterns. So, Thursday, Tuesday, Monday, and it's giving us kind of the day of the week when we had the most sign-ups. And then if we come down here, we also have another one where we're getting the referral source, basically social media, organic referral, Google Ads or partner, the total clicks, and the countries reached. And if we go down here, we have this last table, but it hasn't been run yet cuz this is actually creating our table. And so, this one should be really interesting, but let's actually stop it really quick, and then I'm going to accept and then run this as well. I just want to see what this one is. And so, and then we have day name, the referral source, sign-ups, and unique countries. I think this is the one that I, you know, was kind of hoping for when I asked it to run it for us, but it gave us different options, which I like. Now, all we have to do is we have to get rid of this, and we're going to let that run. And so, let's accept that, and let's run this as well. Now, this display is literally just displaying right up here, so we aren't actually reading this in, but let's come back into our catalog, and let's go see if we have that gold table now. So, now we have our video, we have our insights gold, and let's just look at our sample data, and there we go. And so, this would be like our gold table that we can now use. We now have some insights into our data. Now, all we've done so far, if we come back in here, all we've done so far is we've just written code. We haven't necessarily created any type of pipeline. And so, now this is the part of the video where we're going to get into building an actual pipeline, and I did it this way very specifically. This is how I tend to write my code. I come into a notebook, I write out my code, and then I'm like, "Okay, this is looking good. Let me now go create my pipeline. " So, let's come right over here. I'm going to come down to our runs, and there's this thing right here that says ETL pipeline. Now, let's get rid of this. We also have this right here, which is kind of what we're going to cover a lot in the next lesson, but I want to talk you through really quickly while we're here the difference. Now, we created two separate notebooks, one from bronze to silver and one from silver to gold. Now, sometimes with simpler pipelines like the one we just created, it could be totally fine to just come in here, create a job, and say, "Do this one and then do this one. " Right? That's all we're doing. We can put it on a schedule, or we can create, you know, a different trigger for that, and we'll look at that in the next lesson. But, if you have a more complex pipeline, you're typically going to want to use this right here, which is our ETL pipeline. Let's go ahead and click in on this ETL pipeline, and let's come down here to start with an empty file. Now, you can start with sample code in SQL, sample code in Python, or if you have ones that you've already done, you can do that. We don't have anything, and I don't really want to kind of explain all of the sample code that they're going to be creating. Let's just start with an empty file. Now, we need to specify the language that we're using, and this is very important because once you create it, that's kind of the one that you're going to stick with. We're going to use Python, and this is just asking for a folder path. And so, we'll keep that, and we'll say, "Yes. " And now what we have looks very similar, right? We have these kind of some notebooks on the left, and then we can write our code right here. It looks very similar. But, there is a big difference between running something in a notebook like we were in our workspace before and running something in an ETL pipeline. When you're just running your code, it's running the code as is. It's pretty simple. And if you did what we said earlier, which is you literally just take that notebook, you put it into a job, and you say, "Run this and then run this," it's literally just going to take your code and run it. The issue with that though is it's not going to have any built-in data quality checks. We're going to have to manage basically all of the logic ourselves, and it's not going to handle any lineage tracking or dependencies within your code. Now, this is where ETL pipelines come into play. An ETL pipeline is going to have things built into it like automatic incremental processing, built-in data quality checks, failure recovery, things like that are extremely useful when you have really complex pipelines, which we aren't doing in this lesson, of course. This is very simple, but you have to think, you know, if you're creating a real ETL pipeline with a lot of dependencies, a lot of complexities to it, you absolutely are going to want to come in here. Now, when we write this out, we can't just write it as our regular code. And we can actually do that. Let's come back, and let's go to all of our files. Let's go to bronze to silver transformation. We're going to move this just so we can visually see it. We're going to put this in our transformations, and then we're also going to take our silver to gold, and we're going to move this to our transformations as well. So, we're going to put this all in one place. And so, now we have the silver to gold, and we have the bronze. We don't actually need this file anymore. So, we could just get rid of this. Now, your UI might look slightly different. That's just because Databricks is always updating things, but you should still be able to follow along. But, let's go ahead, and this is our code. It's exactly how we wrote it before. Let's try to run this pipeline. It's going to try to run this, and it should try to run that, too. Let's just go ahead and run it and see what happens. All right, so we got this error down here that says pipelines are expected to have at least one table defined, but no tables were found in your pipeline, which might seem very counterintuitive because, you know, we've created different data frames. We've been working with tables. So, it should understand what it's doing. Now, it is actually rewriting the code as we go. I think it's identified the issue already, and let me explain this even though it's starting to write it out already for us, which is awesome. Thank you, Genie code. But, here's what's happening. When you're running code just in a notebook, it's just going line by line and running the code. But, within this ETL process, and just ignore that for a second cuz I'm just going to let it run. Within this ETL process, what it's using is something called an STP, which is a Spark declarative pipeline. This is just a different construct and a different framework within the ETL pipeline. And so, what it actually needs is something called a materialized view. It needs to kind of look at what the output is going to be or supposed to be. It's not just blindly running your code for you. It's doing a lot of heavy lifting with data quality checks and all these different things. Now, it just went through and it fixed it for us. It is basically the same code, and let's come up, but it's creating these materialized views. So, we have DP. materialized view, and it's kind of naming it and giving a little comment on what it is. It's doing the work for us, and then it's creating another materialized view where we use this insights gold, and it's actually putting it all into one, which is fine if that's what we want to do with this pipeline. But, let's go ahead and accept this, and let's try running this pipeline again. So, now we have a little bit more information. We can come right down here, and we can see it was trying to create these different materialized views, and it was working. And so, now this whole thing has run successfully. It's actually rename this really quick. We're going to do bronze, I need to spell bronze right, bronze to silver to gold ETL pipeline, and let's save it like that. And we come back over here, we can go to our jobs and pipelines. We now have this pipeline right here. We, of course, it failed, but now it's running and it's working successfully. But, now we have this pipeline that we have stored, and we can actually start using this in, you know, automations where we can orchestrate these pipelines right here. It says orchestrate notebooks, jobs, queries, and more. And there is a lot to that, and that's what we're covering in the next lesson. But, if we open this up, we can actually see what's happening under the hood. We can see these are connected. We're doing this one and then this one, and we can see how it's running. And so, there's a lot of things that this ETL pipeline is going to handle for us that we don't even have to worry about. That really is one of the biggest advantages of using an ETL pipeline instead of just running your notebooks. Although, again, there are some advantages to just running your notebooks as is if it's a little bit of a simpler pipeline. I really hope you're able to follow along with this lesson because this is really cool stuff. You can also just come into here, and we can create an ETL pipeline, and you can create a pipeline with AI. So, we can literally just come here, and we can type in exactly what we want our code to look like and do within our data, and it can build that out. Instead of starting with a notebook and then creating our ETL pipeline, you can just come right in here and start doing that process here. I will say though, my personal workflow, because I'm usually not doing super complex pipelines that are involving, you know, ton of different dependency chains and all these different things, is I tend to like writing my code in notebooks. That's just what I'm used to. But, there are going to be lots of use cases where you're going to need to come in here, and you can just start here instead of starting with a notebook.

### Orchestration and Automation [29:06]

Now, in the last two lessons, we've been building out our ETL pipeline. We've been writing all of our code and getting everything set up. But, once we actually have everything set up, then we need to automate this process so that we don't have to manually go in and run the code ourselves. Luckily, Databricks has this already built out for us. It is called a job. And so, we're going to jump into Databricks. We're going to create our own custom job, and we're going to see all the small things that you need to do in order to create this automation. Now, in our last lesson, we built out this bronze to silver to gold ETL pipeline, and we're basically creating two separate tables, this S3_clean_silver, and then this insights gold. And that is our silver and our gold tables after they're transformed, and we find our business insights. Now, just for demonstration purposes, I also just kept our regular code in here as well. We have this bronze to silver, then we have another notebook for silver to gold. Now, these are just regular notebooks in Databricks, but I do want to show you how you can use this within a job as well. But we have this bronze to silver transformation, and you can see it in a pipeline. And then if we just go to our bronze to silver, this is just a regular notebook. Now, in order to create our job, let's come right down here. We're going to go to runs. We're going to come over to this is orchestrate notebooks, pipelines, queries, and more. So, let's come in here. Now, this is a new UI for us, and what you can do here is you can orchestrate the different steps that you want within your job. If we click right down here, we can see all the things that we can do. We can create ingestion pipelines, or we can use existing ones. We can come down here, and we can run notebooks, Python files, SQL queries, SQL files. And we have some more advanced things right down here like if else conditions, or you can create triggers from another job. And then we also have this ingestion and transformation, and these are really useful because if you have an ingestion pipeline, an ETL pipeline, or a database table sync, then you can just use those that you've already created. Now, we've created an ETL pipeline. Let's go ahead and click on this ETL pipeline. We're going to come down here, and we're going to click on this bronze to silver gold ETL pipeline. Now, I'm just going to call this uh bronze to silver to gold, keep it simple. And all we would need to do is create this task. Now, of course, that would be a little too simple, right? But this is as simple as it can get for any type of pipeline orchestration that you're trying to do. Often times, when I'm creating entire pipelines, and there's a lot of different steps to it, I package everything into an ETL pipeline, and then I just place it in here. And then, what I'll do is I'll come over here to schedules and triggers. Now, we'll look at that in just a second. Really quick, we can also trigger a full refresh on this pipeline, so we can click on this. We can also add notifications if you want to send this notification when it kicks off or when it finishes. We can also look at retries. Now, this is really important because sometimes you are going to have things that fail just for a various number of reasons. Maybe you're trying to run this, but the data hasn't all imported yet, and so you're trying to run this transformation, but there's some connection issue, and that causes it to fail. You'd want to retry maybe an hour later or at a different day. You would want to attempt to try this. And so, you can come in here, and you can say, "Okay, I want to try this a ton of times. Let's try it 30 total times, and every single time, we're going to wait maybe 30 or 40 minutes between each try, and then it'll keep trying until it is successful. " Again, with this, you can notify yourself and make sure that you know what's happening, especially if this is a really important pipeline within your company. It is important to have these things set up, so you don't have to manually go in there and see it failed, you know, last night and just never got a notification, it never tried again. So, this would absolutely be something that you'd want to do. And then you have metric thresholds. You can set these, especially for something like a run duration. If you know this should take 5 minutes at most, you can set a timeout threshold or a warning threshold at maybe 30 minutes, so that it isn't just going to keep running cuz sometimes it gets stuck in these loops, and it keeps trying, and it's going to run forever, cost a lot of money, and you don't want that to happen. So, these are all really important things to think about when you are actually creating these jobs. Now, let's come back here to schedules and triggers. For something like this, when you've done almost all the work in an ETL pipeline, you are going to want to schedule or trigger this most of the time. Now, for something like this pipeline, what we've done is we've extracted data out of an S3 bucket. What we would want to do is probably set a trigger for this. Now, what we need to do is we need to create this task first, so that it's saved in there. And then, let's say this is our entire job. It's a very simple one. But now we can come in here, and we can add a trigger. There are several different types of triggers. One, we have a schedule, which is as simple as it sounds. We are just going to schedule this. Right now, it'll be active, you can pause it. We're just going to schedule this, and we'll say every 1 week. And so, every 1 week, we're going to save this, and this is going to run every week. So, that's super simple. Now, let's delete this, and let's add another trigger. We can also schedule it, and we can go a little bit more advanced, and we can schedule it at a very specific day and time. Now, this is what I usually do because there are certain cadences and timing to things that I really like. For example, at a previous job that I used to work at, we wanted the data to be as fresh as possible because we actually had it refresh often, like every 10 minutes. And so, what we were doing was we were trying to run it as soon as we could in the morning to where it would still run, but it would give us the freshest set of data by about 8:30 in the morning. So, we would kick off this job at like 7:45, so that the freshest data would be available by 8:30. This is more advanced, you don't have to do this, but this is a really useful thing to do. The next thing that you can do, or the next type of trigger, is a file arrival. So, if we click on file arrival, we're going to say when a file arrives at this location, kick off this job and run everything within it. Now, for our process, this would be like our S3 bucket. If and we can go and look at our S3 bucket, if a new file gets dropped in here or this gets updated, then we may trigger this job, and it will run. And of course, we have advanced settings as well, where we can wait a minimum time between triggers because what if you're uploading a lot of documents at the same time? You don't want it to trigger 20 times because you just dropped 20 different files in there one at a time. You'd want to wait for all these files to get in there. So, that is absolutely an option. And if we go back, we also have a table update. So, this would trigger when new data is updated on a table. Now, for our use case, this may work because we have S3 data, we're bringing it into our bronze table. So, I can come in here, and I can say when this table, and I would just specify that table name that we've been using, when this bronze table gets updated from that S3 bucket, then kick off this job, which of course, this ETL pipeline takes that bronze data, we transform all the data, we create our gold tables, and then we have all that data sitting there. So, this might be a really good use case. We have some advanced options down here, minimum time between triggers and wait after last change, just like we did before. Cuz sometimes data gets updated continuously, and so it might trigger it many times. These are things that you should test and try out within your pipelines just to make sure you get them right. Now, let's cancel out of this, and let's actually get rid of this entirely. Let's actually come here, and we're going to go back to our runs, or sorry, back to our jobs. And I want to show you one more thing within here that might be really useful. Now, we just kind came down here, and we pulled in uh this ETL pipeline, but let's actually pull in and run a notebook. So, we're going to specify our notebook. We're just going to do this as our bronze to silver. And this is a notebook. It's within our workspace, not a Git provider. And let's select our notebook. So, we're going to come in here. We're going to do bronze to silver. Let's confirm this. And you'll notice we have a lot of different options in here, some similar, right? We have retries, we have notifications, and we have metric thresholds, but we also have parameters. These are parameters that you can pass down to the task. Because this is just a notebook, it doesn't have all that built-in stuff that we were talking about in the last lesson within the ETL pipeline. So, you do need to configure this a little bit more within a job. So, we can add these parameters where create these kind of key-value pairs that we pass into a notebook, but let's come in here. Let's create this task. And now we're going to add in another task. So, let's come here. notebook, and this is going to be our silver to gold. Now, these two tasks, and let's actually name this. These two tasks that we've created, these two notebooks, do the exact same thing as our pipeline. But I wanted to show you this because it does give us some more information when we're actually building out these jobs. So, we specified our path, we have our computer serverless, but now we have something called a dependency or a dependency chain. This right here, this line, is a dependency. With what we have right now, this silver to gold is completely dependent on this bronze to silver, which means if we get this data in, and this bronze to silver does not run correctly, then this silver to gold is never going to run. And in this use case, that's perfectly fine because this relies heavily on this bronze to silver. But there are going to be use cases where that is not the case, where we would not want that to be, you know, a dependency. We wouldn't have to rely on it. Or, we also have an option right down here to run if dependencies, and we have a lot of different options. So, right now, all succeeded means this has to run properly in order for this to run. But there are going to be cases when you create these chains or these dependency chains where you're like, "It doesn't matter if this one runs. We just want it to run after this one runs, whether it fails or not. " And so, for that one, you can come in here and say, "At least one succeeded, none failed, all are done, at least one failed, or all failed. " It doesn't matter. You can specify whichever option you need. For us, we would want to keep this all succeeded because if this one runs, we don't actually create the silver tables that are needed in order to run this one. So, that is pretty important. We can come down here, and we can create this task. And now we have this job that we've created, and we can run it now, or of course, we could add in our trigger. Now, typically with something like this, it could go either way. You could have it on file arrival, table update, or a schedule. It really is just very dependent on your workflow and how you want this to trigger. For most of these, you're going to have some type of trigger. Let's just set it on a schedule and let's go to advanced and we're going to set this for every week. And let's do this on a Monday and let's do it at 7:45 cuz that's when I used to do our some ones at a previous job. So, I'm going to do it 7:45 every morning. Let's go ahead and schedule this. And now we've updated this job and now we can also rename this. I'm going to call this call this our silver to gold job. So, now if we go back to our jobs and pipelines, we have our silver to gold job right here. This was the pipeline that we built out in the last lesson and this is going to be orchestrated and scheduled to run this pipeline. Well, actually we used the notebooks instead of the pipeline for that last example, but we're going to be running that code to actually create and update those tables. So, that is how we create a job in Databricks. This is extremely useful. Again, like we did just a little bit ago for our silver to gold job and let's go into the tasks. If it's a really small transformation and maybe it's just for me, I'll just do it like this where I just have the notebooks. But if it's a larger transformation, especially if there's a lot of dependencies, complexity, I will use an ETL pipeline. So, get in here, mess around with this, try this out because this is super fun to play around with and kind of get all those dependency chains going and getting the ETL pipelines where they're triggering off of each other or when a file is updated. This is really cool stuff to mess around with and is awesome to use within

### Building a Full ETL Pipeline [41:16]

Databricks. Now, if you haven't been following along in the past three videos in this series, we've covered several things. One, we've just learned about ingesting data. Then after that, we looked at ETL pipelines and then we looked at creating a job to orchestrate all these things and to kind of automate the process. In this video, we're going to be putting all of that together into one. We're going to add some things that we didn't cover in previous lessons to make it a little bit more advanced, but it's going to cover a lot of the same concepts. Let's not waste any time. Let's jump right onto my screen and get started. Now, before we actually jump into Databricks, what we're going to be working with is that same S3 bucket that we created earlier, but I created this transactions folder and that is going to be an important piece of this process. It's something that we touched on in a previous video, but we're actually going to be doing it in this lesson. So, we use this users_dirty. csv in this bucket, but inside of this transactions, we have three separate transaction files and we'll actually be adding another one later on to show how the entire process works. So, I'm going to have these and the other file down in the description. You can just download those from GitHub, but we will need those. So, we're just going to start off with these three, the 1_6, 1_13, and 1_20. Now, really quick, just to show you what data we're working with, this is our data. Let me actually zoom in just a little bit. Uh the data itself is not as important for this specific project just because we're more focused on the process of building the pipeline within Databricks, but within the project, we will be cleaning this data a little bit because this is just a horrible column. Uh I think whoever, you know, was collecting this data just left this free text or something for the people to just put whatever they wanted in there. Uh not a good system, but that is the kind of data that we're going to be working with. So, let's come up here. Let's get out of this. We don't need to save it. Now, let's come up to our Databricks. Now, in our previous lesson, this is what we built. We built this uh pipeline right here, bronze to silver to gold ETL pipeline. And then in the very last lesson, we created this silver to gold job, which basically scheduled this and automated this and it ran successfully and everything was great. Now, what we're going to be doing is it in a similar fashion, but we're covering some new things. All you need to do, and I actually have another tab for this cuz I don't want to have to keep going back and forth when we're building this out, but I created this end to end schema within our data engineering catalog. You don't have to do this. You can put this wherever you want. I just did this as kind of where we'll be building things out. So, I'll just come back to this as we start adding in new tables, as we start creating this stuff, I'm going to come back to that. Now, this is where we're going to be doing a lot of our work on this uh tab right here. So, let's come over to data ingestion. Let's go over to our Amazon S3. Now, if you haven't already in a previous lesson, I think the second video, we connected to an S3 bucket. So, if you don't know how to do that, then come over here and do this. Now, we used it for the one time cuz all we used was this users_dirty. csv, but in order to schedule this data ingestion, we're going to use a folder. So, we have this transactions folder right here. So, we're going to click on this. transactions and we have those three separate files in there and we can schedule when we want to bring those in. Now, we can be very specific or pretty laid back. Uh so, for example, if we want to do, you know, once a day, we can specify what time of day we want that and that's similar to a job, so it's not that crazy. Now, what we're going to be actually doing is we're going to schedule this for basically every 30 minutes. And what we're going to do is we're going to build this entire thing out and what our trigger is going to be inside of our job is when a table gets updated. So, then we're going to drop a file in our S3 bucket and when this brings it in at that 30-minute point, it's then going to refresh, kick off the job, which runs our ETL pipeline. We should be able to do all this within 30 minutes for sure. So, I'm going to say every 30 minutes and we'll just set it at 0 minutes past the hour, which means at basically the top of the hour. Now, this is my time zone, but you can set it to whatever time zone you want. Now, let's go ahead and preview this table. It's going to start up our compute and then it's going to give us uh basically what we need in order to create this table, which is our preview and then where we want to place it along with the table name. Now, an important thing to note from just those three files is there's only 50 rows of data in each one. So, if we come down here, we got all the way up to 100, so we at least know two of those files are coming in just from this preview. We're going to keep this as the transactions, but for the schema, we're going to add the end to end, which is the custom one that we created for this project. So, we have transactions right here. Let's go ahead and create this streaming table. So, now this table has been created. Let's just look at a sample of this data. It should show us enough to be confident all three got in, but then we can just also run a query and that's probably fine. In fact, instead of waiting, uh never mind, we got them all in. I was going to say we don't have to wait on this. We could just run a query in like a notebook or a SQL ender, but we have all 150, so that's all three files. So, now that we know we have all three of our files in cuz it's 50 each, it's going to be 150 rows. Now that we know those are in, we can start building things out. Now that we know that it's all in there, what we can do is let's come over here to our jobs and pipelines. Now, this is where we were before. We only had these two things. We had a pipeline and now we had a job and now we have another pipeline. We didn't build this ourselves. This was built automatically and if we come in here, we can get a little bit of information on this. This is our streaming pipeline that we created to put into this end to end transaction. So, this is that streaming table that we created. And so, we don't have to technically manage this. It's going to be managed by Databricks itself. And so, this is just something to note that when we did that, we did create its own pipeline for this. Now, what we need to do is we need to create an ETL pipeline. So, let's come in here. We're going to click on the ETL pipeline. This new UI pops up right away. We don't have the options that we had before in previous lessons. Um but now what we're going to do is we're going to start building this out with Genie code. Now, I could absolutely just write all this out and this would be like an hour and a half video or we can have Genie code write it out, which I highly recommend trying it out and starting to use these tools cuz they really speed up your work and if you already know how to program, if you know how to code, this is going to be a huge boost to your productivity. And so, what we're now going to do is use Genie code right down here, basically tell it what we want to build and we're going to do a few things. One, we want to build that bronze to silver, which is basically our raw data, which is that transactions table, to a silver table, which is where the data is cleaned, to then a gold table, which is what we'd use for like a production level uh product or production level analysis or whatever that might be. So, we can come in here and we can use that app and it is prompting us to do that. And if we come in and we can say data engineering. end to end. And I'll just put it like that, so it's looking kind of at that schema. I'm just going to say uh for the transactions table, I want to create a bronze to silver transformation on this raw data. I want you to clean this data set. I'm just going to leave it really open-ended just to see what it does. Maybe it catches something outside of that column. I don't think it will, but uh let's just see what it does. Then we are going to create a silver to gold transformation. And you can do this in the same notebook or separate notebooks. It may also do that for you with Genie code, but you can be really specific and it's honestly pretty great at what it does. And I want to track daily transactions in that gold table. So, I'm going to give it just this to work on. It's going to take that. It's going to kind of create its logic. It's going to start writing everything out. Um I have found this is not just me saying this. I genuinely love working in this system because Genie code is very good at understanding context and what you're trying to do and working with tables and just everything. And so, we're going to let this run for just a little bit. I'm going to come back. We'll take a look at what it said and then we'll commit some code to start going on this ETL pipeline. All right. So, it just finished. I haven't even really reviewed this cuz it only took, you know, 30 seconds. But, it took a look at the data. Then it came down here and it gave a proposed pipeline architecture. So, here's what we have. We have our bronze layer, which is just going to be our transactions. py. And this read in the data as is. So, it's going to recreate basically the raw data, which I'm totally fine with. It's not a big deal. Then for our silver layer, we have the silver transactions_clean. It's going to trim the white space, standardize capitalization, remove duplicate spaces, filter out null transaction IDs or negative quantities and amounts, and add data quality expectations. I think these are all perfectly reasonable things to do. Then we have our gold layer. It's going to be transformations_gold_daily_transactions_summary. py. So, there's three different files that it's going to create. And it's going to aggregate some of this data into kind of these metrics right here. I think this all looks great. If there was something I wanted to change, I would just tell it, "Hey, let's do this instead. " So, let's just say, "Go for it. Start writing the code, my friend. " It really is my friend at this point. I've been using it a lot. So, let's let this run. Let's watch the code and then we will commit everything and then it probably will prompt us to do some type of dry run to make sure that there aren't any errors that we're just missing. And then we will run the entire thing and start automating this with a job as well. It's still writing. That was like 10 seconds I stopped talking. But, it's still writing everything. It's going to start organizing this. It's going to start creating our. py files or just our Python files. I just am reading it as is. Um it's creating our Python files and then it's going to start writing the code in which we are then going to review, approve, and then run. You can see these things starting to pop up. So, we have our code. We have our diff or, you know, if we had code that it took out, it would also say the minus. Uh but, we're just creating code right now. And so, right here, it's saying, "All right. Do we want to try dry running this pipeline? just see if it works? " Um and of course, we're going to do that in a second. But, I'm going to go to each one just to kind of see what it's doing. It looks like this is our gold and we're just using a group by for this. Uh let's just see what it did for the data cleaning. So, it looks like it is going to drop some stuff in here. But, we are looking at some regex replace, which is great. Some trimming and proper case uh for a few other stuff. And this looks perfectly uh good to me. I have no problem with what it's doing. Again, this is all subject to be altered. If you want to change this or have it do other things or fix the code yourself, you absolutely can do that. Now, all we're going to do is we're just going to accept this. And so, we're going to allow this and it's going to run a dry so we'll accept, review next, and accept. I didn't have I could have done that in a different way. But, now we're going to try dry running this pipeline. Now, what this does is it is not going to actually run through and run your code. It's doing a dry run. It's basically testing are there any big errors that we need to fix before you actually implement this into, you know, whatever process you're doing so that we don't have issues right off the bat. It's going to run for just a little bit and then it'll tell us if there's any big issues. Um often times, if you've never done this before, you shouldn't have any big issues. But, you could get issues like, "Oh, this table uh you don't have the permissions for this table. " Maybe you wrote something incorrectly or in this case, uh you know, Genie code wrote something incorrectly that is not going to create the materialized view properly or you're pulling from a table that doesn't exist anymore. So, there's lots of issues that could arise. But, let's let this run. It shouldn't take very long. And just like that, we did encounter a small issue. Um it's actually going to run. It'll probably fix this very easily. Um I am not exactly sure what the issue is here just glancing at it. But, it looks like it's fixing that code for all of it. And let's go ahead and just accept that and let's try dry running this one more time. Now, it looks like everything is running properly and this is really good. So, what we can now do is we can rename this. So, it's going to give us some feedback on that. But, I'm going to rename this and I'm going to say this is our end-to-end uh ETL pipeline. And that's what we're going to name it. So, we have our end-to-end ETL pipeline. And with this, if we come back here, obviously nothing has changed, right? This is just a dry run that we did. Now, what we can do is we can actually run this pipeline and it will run everything. It's going to do all the transformations, all the things that we would want it to do. And we should and we will do that in a little bit. Now, what we want to do is we want to automate this process. All we have to do is we're going to come back here to not data ingestion into runs. And let's get rid of this. Now, we're going to create a job for this. So, we're going to come in here and we're going to say we want our pipeline. And if we come in here, we have our end-to-end ETL pipeline. That's the one we want. We're just going to call this um end-to-end ETL pipeline. Keep it simple. Now, what we're going to do, and you can always come in here and add notifications and retries and metric thresholds, which we covered in the last lesson. Now, we're going to create this task. But, now we're going to add this trigger right here. Now, this trigger is going to be a table update. So, what we want it to do is when new data is actually updated and brought into that table, we want this job to kick off so that it runs our entire ETL process to clean the data and put that new data into our, you know, new tables that we're creating. Now, what we want to do is we want to say this table, when this table gets updated. So, let's come up here and we're just going to copy this name to the clipboard. And we're going to put it right down in here. We could also have typed it out. Um either one's fine. But, I just wanted to copy it. So, when this table gets updated by our S3 process, which we're running every 30 minutes, this is going to kick off the ETL pipeline, right? It's this right here. So, now that we have that job updated and created, let's go back to our jobs and pipelines. And now we have a few new things in here. So, right here we have our end-to-end ETL pipeline. I should have named this job. Let's actually come in here really quick. I'm just going to come up here. I'm going to rename this. I'm going to say job to run end-to-end pipeline. And let's rename this. So, we have our pipeline. We have our job to run the ETL pipeline. And we have our transactions. It's going to run 30 minutes on the minute. It looks like um it may have already run before. No, I think we're good. No, it did. It's already run twice. I think that's just because of when I set it to [snorts] the zero time. Perfectly fine. Um but, what we're going to do now is we are going to just check that this end-to-end ETL pipeline is working properly. It's going to create all of our tables. We're going to then write a query just to show that the data looks good. And then we'll go drop our extra file in there. And then we'll wait to have it update and the ETL pipeline bring in the data. Then our job is going to trigger and then it'll run our ETL pipeline to bring in and clean that new data as well. So, let us run this pipeline. This is going to take just a little bit to actually run. And then we'll go check the data in just a little bit. All right. So, this looks like it worked properly. We have completed, and completed. Let us come up here and let's go back and let's refresh this. And it is possible that I put it in the wrong place. And it totally is. I absolutely forgot to change that in the ETL pipeline. It is pointed at the workspace default. Let's actually go back. And you know, this happens. We're going to edit this pipeline. So, it is our default catalog that caused this issue. We have our default catalog and default schema as workspace. And then default. Um you can change this. You don't have to, but you absolutely can. You also, if I'm being honest, I should have fixed this myself or caught uh this right away. I like to be explicit when I'm, you know, writing to places. I don't like to have defaults like this. So, I should have had it specified right here where we're writing it. Should have been like uh you know, data engineering. end-to-end. dot and then the uh table name. Just to be more explicit. And we should have done this in basically all the Python files within the ETL pipeline. Totally fine though. Not a massive deal. Just, you know, something to think about. Now, if we come back to this catalog and we look at this, we can go to let's go to the silver transactions clean. This is going to be our cleaned data. Let's just go ahead and run this real quick so we can look at that sample data. So, now this is our clean data. This product name looks much better. Uh looks really good. It did a few other really small things in here, but this is the main one that we're looking at. If we go back to the bronze transactions uh and look at our sample data, so this is our bronze table. This looks terrible. So, obviously, it did a really good job data cleaning it. Uh and then we'll look at the gold daily transactions summary. And this is looking at the transactions just grouping by and then looking at a lot of our data. And this is great for like a gold table uh that we're going to be using for, you know, some metrics or whatever we want to use it for. So, all of this looks really good. Now, in our silver transactions clean in our sample data, again, we at least in the sample, we don't have a hundred. Let's go and run this. So, let's actually create a notebook with this. And let's run this query. So, now we can see we have 150 rows. And let's add code. And let's copy this. And instead of the transactions clean, we'll say uh let's go see what that table's called. It's not gold customers. I should have kept it up over here. Let's go back to our catalog. See, when I start messing with my systems, I start getting messed up. It's gold daily transactions summary. I could have gotten copied it somewhere else. Uh but I'm going to put it right here so that we can look at this. And what we're going to do now is we're going to go drop in that other file into the S3 bucket. So that we can see when it gets updated and to make sure that the new data gets in there and gets cleaned. So, let's come over here. We're going to upload and let's click on add files. Now, we're going to come here. We have that 127. That's the new one that we didn't have before. Let's upload this and put this into our S3 bucket. And now we have 06, 13, 20, and 27 all in this S3 bucket. So, now what we're going to do is we're just going to I'm going to literally just let this wait. Let's come over here. And right here, this is going to kick off in probably like 5 minutes or so. I'm just going to let it run. This is going to kick off. And then you will see that this job to run the end-to-end pipeline will automatically kick off as well once that table is updated. So, let's just be patient. Let's just wait. I'm going to skip you ahead. And you will see this running in just a little bit. All right. Now, you can see that this is kicked off. Looks like it is running. When this process is finished, it's going to update that table with the new data, which is going to trigger this job right here based on this table update on data engineering. end-to-end. transactions. And that should start any second here. And it looks like that is working. We can see this one running. And since it is literally running this pipeline, we can also see that this one is going to start running as well. It's just spinning up the compute so that it can run properly. Let's go ahead and let this run. And then we're going to see and check in our queries if everything actually went through properly. All right. It looks like this uh job is still spinning, but the pipeline was kicked off successfully. It looks like it ran with no issues. Which is exactly what we want. And now this job is done. So, now our entire process is complete. And it is going to keep doing that every 30 minutes. Uh every single 30 minutes from now until I stop the job or I stop this pipeline from running, it is going to kick this off. It's going to kick off the job. pipeline. Every 30 minutes. Of course, I'm going to stop that cuz that's nuts to keep running. But let's come back. Let's go to our workspace. And I think it's this one. Let's go take a look. Yeah. So, now we have 150 rows. Let's go ahead and run this. We should see 200 rows of data. And there we go. And let's just make sure it's all cleaned properly in that product uh name. Looks great. And let's come down here. And let's just make sure that this gets updated. We have 21 rows, but more than that, it's about the data cuz this is aggregated. So, um let's go ahead and run this as well. And we have 28 rows. That's just another week's worth of data. And these numbers are actually look uh basically the same, but we have this new uh week's worth of data in here that we didn't have before. So, that is the entire end-to-end project. It really brings everything together that we've been working with in the past several lessons into one final project. And I hope you were able to follow along. If you didn't follow along, you just watched this video to the end, I highly recommend using the free edition. I will have a link in the description. You can try all this out completely for free. You don't even have to enter a debit card or credit card, which I love. So, you can just use this. And it is an amazing platform to try out. I highly recommend it. But with that being said, thank you guys so much for watching. I hope you liked this video. I hope you learned something in this entire series. If you did, be sure to like and subscribe. I'll see you in the next lesson. —
