# Data Engineering Best Practices: Idempotency

## Метаданные

- **Канал:** StartDataEngineering
- **YouTube:** https://www.youtube.com/watch?v=SKtKoyVQHXs
- **Дата:** 02.04.2026
- **Длительность:** 5:40
- **Просмотры:** 366
- **Источник:** https://ekstraktznaniy.ru/video/52947

## Описание

Code: https://github.com/josephmachado/data-engineering-course-sample/

This is a sample from my upcoming Data Engineering Course

## Транскрипт

### Segment 1 (00:00 - 05:00) []

In this chapter, we are going to look at idempotency. Let's dig into what we saw earlier with the rerun backfill chapter where we almost always will end up rerunning our pipeline and we want to make sure that rerunning a pipeline will not create duplicate or partial data. Formalizing this, that's what's called idempotency. Basically, the idea that if you run the same code multiple times with the same input, the output should not change. There should not be duplicates, any partial data, it should remain the same as long as the input remains the same. When storing the output in an external data store, again this also should not be duplicated or have any leftover data from an older run. Mathematically, it's defined as this basically f of x is f of x, meaning no matter how many times you apply the function f, in this case which is our pipeline, on the input x, the output should not change. To create an idempotent pipeline, there are two main criteria you want to satisfy. One is atomicity, one is no side effects. Atomicity refers to the principle that a pipeline should only create one table, so our script will only create one table. This can be violated, your pipeline script can create multiple tables, but it becomes incredibly difficult to maintain and debug over time. Then secondly, there are no side effects, meaning the output is just a table. Besides logs and exceptions, it shouldn't modify anything else. For example, changing a variable in some table, that might be okay when you are interacting with systems like logging or data quality logging, that should be fine, but it should not impact the state of any other pipelines. So those are the main criteria I want to be mindful of. And also, you might have noticed we are using functional patterns where each function does one unit of work. In our case, extract, transform and load. And this makes our code super easy to maintain and debug and change over time. And it also makes our code really easy to test, especially the transform function is very easy to test because all it gets as inputs are data frames. And it also represents the nature of data pipelines where you get extract, transform and load. And if you are familiar with DBT, all it does is the transform part and it automates away the extract, but that's why it's very popular. And if you have individual functions, you can run them independently, easier to run in parallel with independent time range as inputs. Let's go over an example. Let's see DIMM customer snapshot. And in this example, we'll see that no matter how many times we run this DIMM customer snapshot pipeline with the same input of customer and customer address, the output will not change. And that is primarily because our transform only depends on the inputs. It's not depending on any other system or any other data that can randomly change. It only depends on input and it always does a create or replace. What that means is it just replaces the entire data with the new inputs. So no matter how many times we run this code, the number of records in the output will always be the same. For this exercise, I'd like you to go through this pipeline, which is fact order lines incremental and determine if it's idempotent or not. And if yes, why? If not, why not? I'll see you in five minutes. Okay, welcome back. It is idempotent because we use override partitions and partition by create and update. So what this means is no matter how many times we run with the same input time range, because this is a fact table, we are running an incremental pipeline, depending on the time range and the input, the output will remain the same because if the input has the same data or new data, it'll just process that and override the entire set of partitions. So there is no way you will have duplicates, there's a partial data from older partitions of that data. So it's a total override and that's why this is an idempotent pipeline. Now then the final question is, is SCD2 with merge into idempotent? Let's look at that, right? But in SCD2, the extract standard, the transformer standard, however, the load is complex. It is not always idempotent because specifically of these two conditions. What's happening here is instead of just writing an output, it's updating existing data and it's only creating and only creating the updates when the incoming data is newer than the existing data. So if you reprocess the same incoming data, it will not do anything because this condition will not be matched, meaning the updated date will be back in history

### Segment 2 (05:00 - 05:00) [5:00]

as opposed to the expected future history. SCD2 constructed using merge into is not idempotent and that's why while we were doing the rerun chapter, you might have noticed we have to do the cleanup first before we can rerun the SCD2 pipelines. So no, SCD2 is not idempotent and that's why it's difficult to build and maintain and typically when there's an issue, you just delete it and recreate it. Much easier than clean up and recreate it and that's why snapshot dimensions are almost always preferred. I'll see you in the next chapter.