# Job Orchestration in Databricks | Data Engineering in Databricks

## Метаданные

- **Канал:** Alex The Analyst
- **YouTube:** https://www.youtube.com/watch?v=p4pdx6iWotU
- **Дата:** 14.04.2026
- **Длительность:** 12:49
- **Просмотры:** 3,526

## Описание

In this series we are going to dive into the Data Engineering side of Databricks!

This video will orchestrating jobs to automate our data pipelines.

Get the Data Here: https://github.com/AlexTheAnalyst/DatabricksSeries/tree/main/Data%20Engineering

Try out Databricks Free: http://signup.databricks.com/?provider=DB_FREE_TIER&utm_source=youtube&utm_medium=video&utm_campaign=AlextheAnalystDE

____________________________________________ 

RESOURCES:

💻Analyst Builder - https://www.analystbuilder.com/

📖Take my Full MySQL Course Here: https://bit.ly/3tqOipr
📖Take my Full Python Course Here: https://bit.ly/48O581R
📖Practice Technical Interview Questions: https://bit.ly/46pDqqL

Coursera Courses:
Google Data Analyst Certification: https://coursera.pxf.io/5bBd62
Data Analysis with Python - https://coursera.pxf.io/BXY3Wy
IBM Data Analysis Specialization - https://coursera.pxf.io/AoYOdR
Tableau Data Visualization - https://coursera.pxf.io/MXYqaN

*Please note I may earn a small commission for any purchase through these links - Thanks for supporting the channel!*
____________________________________________ 

BECOME A MEMBER - 

Want to support the channel? Consider becoming a member! I do Monthly Livestreams and you get some awesome Emoji's to use in chat and comments!

https://www.youtube.com/channel/UC7cs8q-gJRlGwj4A8OmCmXg/join
____________________________________________ 

Websites: 
💻Website: AlexTheAnalyst.com
💾GitHub: https://github.com/AlexTheAnalyst
📱Instagram: @Alex_The_Analyst
____________________________________________

*All opinions or statements in this video are my own and do not reflect the opinion of the company I work for or have ever worked for*

## Содержание

### [0:00](https://www.youtube.com/watch?v=p4pdx6iWotU) Intro

What's going on everybody? Welcome back to another video. Today, we're going to be orchestrating and automating our ETL pipelines in Databricks. Now, in the last two lessons, we've been building out our ETL pipeline. We've been writing all of our code and getting everything set up. But, once we actually have everything set up, then we need to automate this process so that we don't have to manually go in and run the code ourselves. Luckily, Databricks has this already built out for us. It is called a job. And so, we're going to jump into Databricks. We're going to create our own custom job. We're going to see all the small things that you need to do in order to create this automation. Now, in

### [0:37](https://www.youtube.com/watch?v=p4pdx6iWotU&t=37s) Lesson Recap

our last lesson, we built out this bronze to silver to gold ETL pipeline. And we're basically creating two separate tables, this S3 {underscore} clean silver and then this insights gold. And that is our silver and our gold tables after they're transformed and we find our business insights. Now, just for demonstration purposes, I also just kept our regular code in here as well. We have this bronze to silver. Then, we have another notebook for silver to gold. Now, these are just regular notebooks in Databricks, but I do want to show you how you can use this within a job as well. But, we have this bronze to silver transformation. You can see it in a pipeline. And then, if we just go to our bronze to silver, this is just a regular notebook. Now, in order to create our job, let's come right down here. We're going to go to runs. We're going to come over to job and this is orchestrate notebooks, pipelines, queries, and more. So, let's come in

### [1:27](https://www.youtube.com/watch?v=p4pdx6iWotU&t=87s) Job UI

here. Now, this is a new UI for us. And what you can do here is you can orchestrate the different steps that you want within your job. If we click right down here, we can see all the things that we can do. We can create ingestion pipelines or we can use existing ones. We can come down here and we can run notebooks, Python files, SQL queries, SQL files. And we have some more advanced things right down here like if else conditions or you can uh create triggers from another job. And then, we also have this ingestion and transformation. And these are really useful because if you have an ingestion pipeline, an ETL pipeline, or a database table sync, then you can just use those that you've already created. Now, we've created an ETL pipeline. Let's go ahead and click on this ETL pipeline. We're going to come down here and we're going to click on this bronze to silver gold ETL pipeline. Now, I'm just going to call this uh bronze to silver to gold. Keep it simple. And all we would need to do is create this task. Now, of course, that would be a little too simple, right? But, this is as simple as it can get for any type of pipeline orchestration that you're trying to do. Often times, when I'm creating entire pipelines and there's a lot of different steps to it, I package everything into an ETL pipeline and then I just place it in here. And then, what I'll do is I'll come over here to schedules and triggers. Now, we'll look at that in just a second. Really quick, we can also trigger a full refresh on this pipeline. So, we can click on this. We can also add notifications if you

### [2:56](https://www.youtube.com/watch?v=p4pdx6iWotU&t=176s) Notifications

want to send this notification when it kicks off or when it finishes. We can also look at retries. Now, this is really important because sometimes you are going to have things that fail just for a various number of reasons. Maybe you're trying to run this, but the data hasn't all imported yet and so you're trying to run this transformation, but there's some connection issue and that causes it to fail. You'd want to retry maybe an hour later or at a different day. You would want to attempt to try this. And so, you can come in here and you can say, "Okay, I want to try this a ton of times. Let's try it 30 total times and every single time we're going to wait maybe 30 or 40 minutes between each try and then it'll keep trying until it is successful. " Again, with this, you can notify yourself and make sure that you know what's happening, especially if this is a really important pipeline within your company. It is important to have these things set up so you don't have to manually go in there and see it failed, you know, last night and just never got a notification. It never tried again. So, this would absolutely be something that you'd want to do. And then, you have metric thresholds. You

### [4:00](https://www.youtube.com/watch?v=p4pdx6iWotU&t=240s) Metrics

can set these, especially for something like a run duration. If you know this should take 5 minutes at most, you can set a timeout threshold or a warning threshold at maybe 30 minutes so that it isn't just going to keep running cuz sometimes it gets stuck in these loops and it keeps trying and it's going to run forever cost a lot of money and you don't want that to happen. So, these are all really important things to think about when you

### [4:20](https://www.youtube.com/watch?v=p4pdx6iWotU&t=260s) Schedule Triggers

are actually creating these jobs. Now, let's come back here to schedules and triggers. For something like this, when you've done almost all the work in an ETL pipeline, you are going to want to schedule or trigger this most of the time. Now, for something like this pipeline, what we've done is we've extracted data out of an S3 bucket. What we would want to do is probably set a trigger for this. Now, what we need to do is we need to create this task first so that it's saved in there. And then, let's say this is our entire job. It's a very simple one. But, now we can come in here and we can add a trigger. There are

### [4:53](https://www.youtube.com/watch?v=p4pdx6iWotU&t=293s) Adding a Trigger

several different types of triggers. One, we have a schedule, which is as simple as it sounds. We are just going to schedule this. Right now, it'll be active. You can pause it. We're just going to schedule this and we'll say every 1 week. And so, every 1 week, we're going to save this and this is going to run every week. So, that's super simple. Now, let's delete this and let's add another trigger. We can also

### [5:16](https://www.youtube.com/watch?v=p4pdx6iWotU&t=316s) Scheduling a Trigger

schedule it. We can go a little bit more advanced and we can schedule it at a very specific day and time. Now, this is what I usually do because there are certain cadences and timing to things that I really like. For example, at a previous job that I used to work at, we wanted the data to be as fresh as possible because we actually had it refresh often like every 10 minutes. And so, what we were doing was we were trying to run it as soon as we could in the morning to where it would still run, but it would give us the freshest set of data by about 8:30 in the morning. So, we would kick off this job at like 7:45 so that the freshest data would be available by 8:30. This is more advanced. You don't have to do this, but this is a really useful thing to do. The next thing that you can do or the next type of trigger is a file arrival. So, if we click on file arrival, we're going to say when a file arrives at this location, kick off this job and run everything within it. Now, for our process, this would be like our S3 bucket. If and we can go and look at our S3 bucket. If a new file gets dropped in here or this gets updated, then we may trigger this job and it will run. And of course, we have advanced settings as well where we can wait a minimum time between triggers because what if you're uploading a lot of documents at the same time? You don't want it to trigger 20 times because you just dropped 20 different files in there one at a time. You'd want to wait for all these files to get in there. So, that is absolutely an option. And if we go back, we also

### [6:39](https://www.youtube.com/watch?v=p4pdx6iWotU&t=399s) Table Update

have a table update. So, this would trigger when new data is updated on a table. Now, for our use case, this may work because we have S3 data. We're bringing it into our bronze table. So, I can come in here and I can say when this table and I would just specify that table name that we've been using. When this bronze table gets updated from that S3 bucket, then kick off this job, which of course, this ETL pipeline takes that bronze data. We transform all the data. We create our gold tables and then we have all that data sitting there. So, this might be a really good use case. We have some advanced options down here, minimum time between triggers and wait after last change just like we did before cuz sometimes data gets updated continuously and so it might trigger it many times. These are things that you should test and try out within your pipelines just to make sure you get them right. Now, let's cancel out of this and let's actually get rid of this entirely. Let's actually come here and we're going to go back to our runs. Or sorry, back to our jobs. And I want to show you one more thing within here that might be really useful. Now, we just kind came down here and we pulled in uh this ETL pipeline. But, let's actually pull in and run a notebook. So, we're

### [7:53](https://www.youtube.com/watch?v=p4pdx6iWotU&t=473s) Notebook

going to specify our notebook. We're just going to do this as our bronze to silver. And this is a notebook. It's within our workspace, not a Git provider. And let's select our notebook. So, we're going to come in here. We're going to do bronze to silver. Let's confirm this. And you'll notice we have

### [8:07](https://www.youtube.com/watch?v=p4pdx6iWotU&t=487s) Notebook Parameters

a lot of different options in here. Some similar, right? We have retries. We have notifications and we have metric thresholds, but we also have parameters. These are parameters that you can pass down to the task. Because this is just a notebook, it doesn't have all that built-in stuff that we were talking about in the last lesson within the ETL pipeline. So, you do need to configure this a little bit more within a job. So, we can add these parameters where we create these kind of key-value pairs that we pass into uh a notebook. But, let's come in here. Let's create this task. And now, we're going to add in another task. So, let's come here. notebook. And this is going to be our silver to gold. Now, these two tasks, and let's actually name this. These two tasks that we've created, these two notebooks, do the exact same thing as our pipeline. But, I wanted to show you this because it does give us some more information when we're actually building out these jobs. So, we specified our path. We have our computer serverless, but now we have something called a dependency or a dependency chain. This right here, this line, is a dependency. With what we have right now, this silver to gold is completely dependent on this bronze to silver, which means if we get this data in and this bronze to silver does not run correctly, then this silver to gold is never going to run. And in this use case, that's perfectly fine because this relies heavily on this bronze to silver. But, there are going to be use cases where that is not the case, where we would not want that to be, you know, a dependency. We wouldn't have to rely on it. Or, we also have an option right down here to run if dependencies and we have a lot of different options. So, right now, all succeeded means this has to run properly in order for this to run. But, there are going to be cases when you create these chains or these dependency chains where you're like, it doesn't matter if this one runs, we just want it to run after this one runs, whether it fails or not. And so, for that one, you can come in here and say, "At least one succeeded, none failed, all are done, at least one failed, or all failed. " It doesn't matter. You can specify whichever option you need. For us, we would want to keep this all succeeded because if this one runs, we don't actually create the silver tables that are needed in order to run this one. So, that is pretty important. We can come down here, and we can create this task, and now we have this job that we've created, and we can run it now, or of course, we could add in our trigger.

### [10:41](https://www.youtube.com/watch?v=p4pdx6iWotU&t=641s) Triggers

Now, typically with something like this, it could go either way. You could have it on file arrival, table update, or a schedule. It really is just very dependent on your workflow and how you want this to trigger. For most of these, you're going to have some type of trigger. Let's just set it on a schedule, and let's go to advanced, and we're going to set this for every week, and let's do this on a Monday, and let's do it at 7:45 cuz that's when I used to do our some ones at a previous job. So, I'm going to do it at 7:45 every morning. Let's go ahead and schedule

### [11:11](https://www.youtube.com/watch?v=p4pdx6iWotU&t=671s) Rename Job

this, and now we've updated this job, and now we can also rename this. I'm going to call this our silver to gold job. So, now if we go back to our jobs and pipelines, we have our silver to gold job right here. This was the pipeline that we built out in the last lesson, and this is going to be orchestrated and scheduled to run this pipeline. Well, actually, we used the notebooks instead of the pipeline for that last example, but we're going to be running that code to actually create and update those tables. So, that is how we create a job in Databricks. This is extremely, extremely useful. Again, like we did just a little bit ago for our silver to gold job, and let's go into

### [11:53](https://www.youtube.com/watch?v=p4pdx6iWotU&t=713s) Outro

the tasks. If it's a really small transformation, and maybe it's just for me, I'll just do it like this where I just have the notebooks. But, if it's a larger transformation, especially if there's a lot of dependencies, complexity, I will use an ETL pipeline. So, get in here, mess around with this, try this out because this is super fun to play around with and kind of get all those dependency chains going, and getting the ETL pipelines where they're triggering off of each other or when a file is updated. This is really cool stuff to mess around with, and it's awesome to use within Databricks. I really hope that this was helpful. If you haven't, be sure to create a free Databricks account. I will have a link down in the description, and you can try all of this completely for free. If you like this video, be sure to like and subscribe, and I'll see you in the next lesson.

---
*Источник: https://ekstraktznaniy.ru/video/49810*