Full End-to-End Data Engineering Project in Databricks

23:51

Full End-to-End Data Engineering Project in Databricks

Alex The Analyst 21.04.2026 9 813 просмотров 303 лайков

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Описание видео

In this series we are going to dive into the Data Engineering side of Databricks! In this video we are building an end to end ETL pipeline. Get the Data Here: https://github.com/AlexTheAnalyst/DatabricksSeries/tree/main/Data%20Engineering Try out Databricks Free: http://signup.databricks.com/?provider=DB_FREE_TIER&utm_source=youtube&utm_medium=video&utm_campaign=AlextheAnalystDE ____________________________________________ RESOURCES: 💻Analyst Builder - https://www.analystbuilder.com/ 📖Take my Full MySQL Course Here: https://bit.ly/3tqOipr 📖Take my Full Python Course Here: https://bit.ly/48O581R 📖Practice Technical Interview Questions: https://bit.ly/46pDqqL Coursera Courses: Google Data Analyst Certification: https://coursera.pxf.io/5bBd62 Data Analysis with Python - https://coursera.pxf.io/BXY3Wy IBM Data Analysis Specialization - https://coursera.pxf.io/AoYOdR Tableau Data Visualization - https://coursera.pxf.io/MXYqaN *Please note I may earn a small commission for any purchase through these links - Thanks for supporting the channel!* ____________________________________________ BECOME A MEMBER - Want to support the channel? Consider becoming a member! I do Monthly Livestreams and you get some awesome Emoji's to use in chat and comments! https://www.youtube.com/channel/UC7cs8q-gJRlGwj4A8OmCmXg/join ____________________________________________ Websites: 💻Website: AlexTheAnalyst.com 💾GitHub: https://github.com/AlexTheAnalyst 📱Instagram: @Alex_The_Analyst ____________________________________________ *All opinions or statements in this video are my own and do not reflect the opinion of the company I work for or have ever worked for*

Оглавление (5 сегментов)

Segment 1 (00:00 - 05:00)

What's going on everybody? Welcome back to another video. Today, we're building our full data engineering project in Databricks. Now, if you haven't been following along in the past three videos in this series, we've covered several things. One, we've just learned about ingesting data. Then after that, we looked at ETL pipelines, and then we looked at creating a job to orchestrate all these things and kind of automate the process. In this video, we're going to be putting all of that together into one. We're going to add some things that we didn't cover in previous lessons to make it a little bit more advanced, but it's going to cover a lot of the same concepts. Let's not waste any time. Let's jump right on to my screen and get started. Now, before we actually jump into Databricks, what we're going to be working with is that same S3 bucket that we created earlier, but I created this transactions folder. And that is going to be an important piece of this process. It's something that we touched on in a previous video, but we are actually going to be doing it in this lesson. So, we used this users_dirty. csv in this bucket, but inside of this transactions, we have three separate transaction files, and we'll actually be adding another one later on to show how the entire process works. So, I'm going to have these and the other file down in the description. You can just download those from GitHub, but we will need those. So, we're just going to start off with these three, the 1_6, 1_13, and 1_20. Now, really quick, just to show you what data we're working with, this is our data. Let me actually zoom in just a little bit. Uh the data itself is not as important for this specific project just because we're more focused on the process of building the pipeline within Databricks, but within the project, we will be cleaning this data a little bit because this is just a horrible column. Uh I think whoever, you know, was collecting this data just left this free text or something for people to just put whatever they wanted in there. Uh not a good system, but that is the kind of data that we're going to be working with. So, let's come up here. Let's get out of this. We don't need to save it. Now, let's come up to our Databricks. Now, in our previous lesson, this is what we built. We built this uh pipeline right here, bronze to silver to gold ETL pipeline. And then in the very last lesson, we created the silver to gold job, which basically scheduled this and automated this, and it ran successfully, and everything was great. Now, what we're going to be doing is it in a similar fashion, but we're covering some new things. All you need to do, and I actually have another tab for this cuz I don't want to have to keep going back and forth when we're building this out. I created this end-to-end schema within our data engineering catalog. You don't have to do this. You can put this wherever you want. I just did this as kind of where we'll be building things out. So, I'll just come back to this as we start adding in new tables, as we start creating this stuff, I'm going to come back to that. Now, this is where we're going to be doing a lot of our work on this uh tab right here. So, let's come over to data ingestion. Let's go over to our Amazon S3. Now, if you haven't already, in a previous lesson, I think the second video, we connected to an S3 bucket. So, if you don't know how to do that, then come over here and do this. Now, we used it for the one time cuz all we used was this users_dirty. csv, but in order to schedule this data ingestion, we're going to use a folder. So, we have this transactions folder right here. So, we're going to click on this. transactions, and we have those three separate files in there. And we can schedule when we want to bring those in. Now, we can be very specific or pretty laid-back. Uh so, for example, if we want to do, you know, once a day, we can specify what time of day we want that, and that's similar to a job, so it's not that crazy. Now, what we're going to be actually doing is we're going to schedule this for basically every 30 minutes. And what we're going to do is we're going to build this entire thing out, and what our trigger is going to be inside of our job is when a table gets updated. So, then we're going to drop a file in our S3 bucket, and when this brings it in at that 30-minute point, it's then going to refresh, kick off the job, which runs our ETL pipeline. We should be able to do all this within 30 minutes for sure. So, I'm going to say every 30 minutes, and we'll just set it at 0 minutes past the hour, which means at basically the top of the hour. Now, this is my time zone, but you can set it to whatever time zone you want. Now, let's go ahead and preview this table. It's going to start up our compute, then it's going to give us uh basically what we need in order to create this table, which is our preview, and then where we want to place it along with the table name. Now, an important thing to note from just those three files is there's only 50 rows of data in each one. So, if we come down here, we got all the way up to 100, so we at least know two of those files are coming in just from this preview. We're going

Segment 2 (05:00 - 10:00)

to keep this as the transactions, but for the schema, we're going to add the end-to-end, which is the custom one that we created for this project. So, we have transactions right here. Let's go ahead and create the streaming table. So, now this table has been created. Let's just look at a sample of this data. It should show us enough to be confident all three got in, but then we can just also run a query, and that's probably fine. In fact, instead of waiting, uh never mind, we got them all in. I was going to say we don't have to wait on this. We could just run a query in like a notebook or a SQL editor, but we have all 150, so that's all three files. So, now that we know we have all three of our files in cuz it's 50 each, it's going to be 150 rows. Now that we know those are in, we can start building things out. Now that we know that it's all in there, what we can do is let's come over here to our jobs and pipelines. Now, this is where we were before. We only had these two things. We had a pipeline, and now we had a job. And now we have another pipeline, and we didn't build this ourselves. This was built automatically, and if we come in here, we can get a little bit of information on this. This is our streaming pipeline that we created to put into this end-to-end transaction. So, this is that streaming table that we created. And so, we don't have to technically manage this. It's going to be managed by Databricks itself. And so, this is just something to note that when we did that, we did create its own pipeline for this. Now, what we need to do is we need to create an ETL pipeline. So, let's come in here. We're going to click on the ETL pipeline. This new UI pops up right away. We don't have the options that we had before in previous lessons. Um but now what we're going to do is we're going to start building this out with Genie code. Now, I could absolutely just write all this out, and this would be like an hour and a half video, or we can have Genie code write it out, which I highly recommend trying it out and starting to use these tools cuz they really speed up your work. And if you already know how to program, if you know how to code, this is going to be a huge boost to your productivity. And so, what we're now going to do is use Genie code right down here, basically tell it what we want to build. And we're going to do a few things. One, we want to build that bronze to silver, which is basically our raw data, which is that transactions table, to a silver table, which is where the data is cleaned, to then a gold table, which is what we'd use for like a production level uh product or production level analysis or whatever that might be. So, we can come in here, and we can use that app, and it is prompting us to do that. And if we come in, and we can say data engineering. end-to-end. And I'll just put it like that, so it's looking kind of at that schema. I'm just going to say uh for the transactions table, I want to create a bronze-to-silver transformation on this raw data. I want you to clean this data set. I'm just going to leave it really open-ended just to see what it does. Maybe it catches something outside of that column. I don't think it will, but uh let's just see what it does. Then we are going to create a silver-to-gold transformation. And you can do this in the same notebook or separate notebooks. It may also do that for you with Genie code, but you can be really specific, and it's honestly pretty great at what it does. And I want to track daily transactions in that gold table. So, I'm going to give it just this to work on. It's going to take that. It's going to kind of create its logic. It's going to start writing everything out. Um I have found, this is not just me saying this, I genuinely love working in this system because Genie code is very good at understanding context and what you're trying to do and working with tables and just everything. And so, we're going to let this run for just a little bit. I'm going to come back. We'll take a look at what it said, and then we'll commit some code to start going on this ETL pipeline. All right, so it just finished. I haven't even really reviewed this cuz it only took, you know, 30 seconds, but it took a look at the data. Then it came down here and gave a proposed pipeline architecture. So, here's what we have. We have our bronze layer, which is just going to be our transactions. py. And this read in the data as is. So, it's going to recreate basically the raw data, which I'm totally fine with. It's not a big deal. Then for our silver layer, we have the silver transactions_clean. It's going to trim the white space, standardize capitalization, remove duplicate spaces, filter out null transaction IDs or negative quantities and amounts, and add data quality expectations. I think these are all perfectly reasonable things to do. Then we have our gold layer. This is going to be transformations gold daily transactions summary. py. So there's three different

Segment 3 (10:00 - 15:00)

files that it's going to create and it's going to aggregate some of this data into kind of these metrics right here. I think this all looks great. If there was something I wanted to change, I would just tell it, "Hey, let's do this instead. " So let's just say, "Go for it. " Start writing the code. My friend. It really is my friend at this point. I've been using it a lot. So let's let this run. Let's watch the code and then we will commit everything and then it probably will prompt us to do some type of dry run to make sure that there aren't any errors that we're just missing and then we will run the entire thing and start automating this with a job as well. It's still writing. That was like 10 seconds I stopped talking, but it's still writing everything. It's going to start organizing this. creating our. py files or just our Python files. I just I'm reading it as is. Um it's creating our Python files and then it's going to start writing the code in which we are then going to review, approve, and then run. You can see these things starting to pop up. So we have our code. We have our diff or you know, if we had code that it took out, it would also say the minus. Uh but we're just creating code right now. And so right here it's saying, "All right, do we want to try dry running this pipeline? just see if it works? " Um and of course we're going to do that in a second, but I'm going to go to each one just to kind of see what it's doing. It looks like this is our gold and we're just using a group by for this. Uh let's just see what it did for the data cleaning. So it looks like it is going to drop some stuff in here, but we are looking at some regex replace, which is great. Some trimming and proper case uh for a few other stuff. And this looks perfectly uh good to me. I have no problem with what it's doing. Again, this is all subject to be altered. If you want to change this or have it do other things or fix the code yourself, you absolutely can do that. Now all we're going to do is we're just going to accept this. And so we're going to allow this and it's going to run a dry. So we'll accept, review next, and accept. I didn't have I could have done that in a different way. But now we're going to try dry running this pipeline. Now what this does is it is not going to actually run through and run your code. It's doing a dry run. It's basically testing are there any big errors that we need to fix before you actually implement this into you know, whatever process you're doing so that we don't have issues right off the bat. It's going to run for just a little bit and then it'll tell us if there's any big issues. Um often times if you've never done this before, you shouldn't have any big issues, but you could get issues like "Oh, this table uh you don't have the permissions for this table. " Maybe you wrote something incorrectly or in this case uh you know, Genie code wrote something incorrectly that is not going to create the materialized view properly or you're pulling from a table that doesn't exist anymore. So there's lots of issues that could arise, but let's let this run. It shouldn't take very long. And just like that we did encounter a small issue. Um it's actually going to run. It'll probably fix this very easily. Um I am not exactly sure what the issue is here just glancing at it. But it looks like it's fixing that code for all of it. And let's go ahead and just accept that and let's try dry running this one more time. Now it looks like everything is running properly and this is really good. So what we can now do is we can rename this. So it's going to give us some feedback on that, but I'm going to rename this and I'm going to say this is our end to end uh ETL pipeline. And that's what we're going to name it. So we have our end to end ETL pipeline. And with this, if we come back here, obviously nothing has changed, right? This was just a dry run that we did. Now what we can do is we can actually run this pipeline and it will run everything. It's going to do all the transformations, all the things that we would want it to do and we should and we will do that in a little bit. Now what we want to do is we want to automate this process. All we have to do is we're going to come back here to not data ingestion into runs and let's get rid of this. Now we're going to create a job for this. So we're going to come in here and we're going to say we want our pipeline and if we come in here, we have our end to end ETL pipeline. That's the one we want. We're just going to call this um end to end ETL pipeline. Keep it simple. Now what we're going to do and you can always come in here and add notifications and retries and metric thresholds, which we covered in the last lesson. Now we're going to create this task, but now we're going to add this trigger right here. Now this trigger is going to be a table update. So what we want it to do is when new data is actually updated and brought into that table, we want this job to kick off so that it runs our entire ETL process to clean the data and put that new data into our you know, new tables that we're creating.

Segment 4 (15:00 - 20:00)

Now what we want to do is we want to say this table, when this table gets updated. So let's come up here and we're just going to copy this name to the clipboard and we're going to put it right down in here. We could also have typed it out. Um either one's fine, but I just wanted to copy it. So when this table gets updated by our S3 process, which we're running every 30 minutes, this is going to kick off the ETL pipeline, right? It's this right here. So now that we have that job updated and created, let's go back to our jobs and pipelines and now we have a few new things in here. So right here we have our end to end ETL pipeline. I should have named this job. Let's actually come in here really quick. I'm just going to come up here. I'm going to rename this. I'm going to say job to run end to end pipeline. And let's rename this. So we have our pipeline. We have our job to run the ETL pipeline and we have our transactions that's going to run 30 minutes on the minute. It looks like um it may have already run before. No, I think we're good. No, it did. It's already run twice. I think that's just because of when I set it to the zero time. Perfectly fine. Um but what we're going to do now is we are going to just check that this end to end ETL pipeline is working properly. It's going to create all of our tables. We're going to then write a query just to show that the data looks good and then we'll go drop our extra file in there and then we'll wait to have it update and the ETL pipeline bring in the data. Then our job is going to trigger and then it'll run our ETL pipeline to bring in and clean that new data as well. So let us run this pipeline. This is going to take just a little bit to actually run and then we'll go check the data in just a little bit. All right, so this looks like it worked properly. We have completed, and completed. Let us come up here and let's go back and let's refresh this. And it is possible that I put it in the wrong place and it totally is. I absolutely forgot to change that in the ETL pipeline. It is pointed at the workspace default. Let's actually go back and you know, this happens. We're going to edit this pipeline. So it is our default catalog that caused this issue. We have our default catalog and default schema as workspace and then default. Um you can change this. You don't have to, but you absolutely can. You also, if I'm being honest, I should have fixed this myself or caught this right away. I like to be explicit when I'm you know, writing to places. I don't like to have defaults like this. So I should have had it specified right here where we're writing it. Should have been like uh you know, data engineering. end to end. And then the uh table name just to be more explicit and we should have done this in basically all the Python files within the ETL pipeline. Totally fine though. Not a massive deal. Just you know, something to think about. Now if we come back to this catalog and we look at this, we can go to let's go to the silver transactions clean. This is going to be our cleaned data. Let's just go ahead and run this real quick so we can look at that sample data. So now this is our clean data. This product name looks much better. Uh looks really good. It did a few other really small things in here, but this is the main one that we're looking at. If we go back to the bronze transactions uh and look at our sample data. So this is our bronze table. This looks terrible. So obviously it did a really good job data cleaning it. Uh and then we'll look at the gold daily transactions summary. And this is looking at the transactions just grouping by and then looking at a lot of our data and this is great for like a gold table uh that we're going to be using for you know, some metrics or whatever we want to use it for. So all of this looks really good. Now in our silver transactions clean in our sample data, again, we at least in the sample, we don't have 100. Let's go and run this. So let's actually create a notebook with this and let's run this query. So now we can see we have 150 rows and let's add code and let's copy this and instead of the transactions clean, we'll say uh let's go see what that table's called. It's not gold customers. I should have kept it up over here. Let's go back to our catalog. See when I start messing with my systems, I start getting messed up. It's gold daily transactions summary. I could have gone and copied it somewhere else, Uh but I'm going to put it right here so that we can look at this. And what we're going to do now is we're going to go drop in that other

Segment 5 (20:00 - 23:00)

file into the S3 bucket so that we can see when it gets updated and to make sure that the new data gets in there and gets cleaned. So, let's come over here. We're going to upload and let's click on add files. Now we're going to come here and we have that 127. That's the new one that we didn't have before. Let's upload this and put this into our S3 bucket. And now we have 06 13 20 and 27 all in this S3 bucket. So, now what we're going to do is we're just going to I'm going to literally just let this wait. Let's come over here and right here, this is going to kick off in probably like 5 minutes or so. I'm just going to let it run. This is going to kick off and then you will see that this job to run the end-to-end pipeline will automatically kick off as well once that table is updated. So, let's just be patient. Let's just wait. I'm going to skip you ahead and you will see this running in just a little bit. All right. Now you can see that this is kicked off. Looks like it is running. When this process is finished is going to update that table with the new data which is going to trigger this job right here based on this table update on dataengineering. end-to-end. transactions and that should start any second here. And it looks like that is working. We can see this one running and since it is literally running this pipeline, we can also see that this one is going to start running as well. It's just spinning up the compute so that it can run properly. Let's go ahead and let this run and then we're going to see and check in our queries if everything actually went through properly. All right. It looks like this job is still spinning, but the pipeline was kicked off successfully. It looks like it ran with no issues which is exactly what we want and now this job is done. So, now our entire process is complete and it is going to keep doing that every 30 minutes. Every single 30 minutes from now until I stop the job or I stop this pipeline from running, it is going to kick this off. It's going to kick off the job. pipeline every 30 minutes. Of course, I'm going to stop that cuz that's nuts to keep running. But let's come back. Let's go to our workspace and I think it's this one. Let's go take a look. Yeah. So, now we have 150 rows. Let's go ahead and run this. We should see 200 rows of data. And there we go. And let's just make sure it's all cleaned properly in that product name looks great. And let's come down here and let's just make sure that this gets updated. We have 21 rows, but more than that it's about the data cuz this is aggregated. So, um let's go ahead and run this as well. And we have 28 rows. That's just another week's worth of data and these numbers are actually look basically the same, but we have this new week's worth of data in here that we didn't have before. So, that is the entire end-to-end project. It really brings everything together that we've been working with in the past several lessons into one final project and I hope you were able to follow along. If you didn't follow along, you just watched this video to the end, I highly recommend using the free edition. I will have a link in the description. You can try all this out completely for free. You don't even have to enter a debit card or credit card which I love. So, you can just use this and it is an amazing platform to try out. I highly recommend it. But with that being said, thank you guys so much for watching. I hope you liked this video. I hope you learned something in this entire series. If you did, be sure to like and subscribe. I'll see you in the next lesson.

Другие видео автора — Alex The Analyst

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник