# Do You Still Need Surrogate Keys? (Modern Data Warehouse Design Explained)

## Метаданные

- **Канал:** Guy in a Cube
- **YouTube:** https://www.youtube.com/watch?v=51He_8RBcyk
- **Дата:** 04.06.2026
- **Длительность:** 6:35
- **Просмотры:** 7,648
- **Источник:** https://ekstraktznaniy.ru/video/52919

## Описание

Are surrogate keys still necessary in modern data warehouses and semantic models?

In this video, we want to walk through one of the most debated topics in data modeling: Natural Keys vs Surrogate Keys.

Patrick is going to show you exactly where natural keys work perfectly fine, where they start to break down, and why surrogate keys still matter in modern analytics solutions built with Microsoft Fabric and Power BI.

Using a hotel warehouse demo, this video walks through:

* Natural keys vs surrogate keys
* Why duplicate business identities create modeling problems
* How semantic model relationships behave differently
* Why surrogate keys preserve historical accuracy
* How fact tables align to dimension versions
* Real-world dimensional modeling concepts in Microsoft Fabric

One of the biggest misconceptions in data modeling is treating surrogate keys as a universal best practice.

This video shows why surrogate keys are not about adding another integer column to a model. They exist t

## Транскрипт

### Segment 1 (00:00 - 05:00) []

Yooo, what is up everyone? I got a question for you. Do you still need surrogate keys in a modern data warehouse? People treat surrogate keys like a best practice. There are scenarios where surrogate keys are absolutely critical, and in some cases they're not really a best practice. They're a response to a very, very specific problem. So in this video, I wanna show you exactly where a natural key looks fine, and then where it starts to fall apart. So you know how we like to do. Enough of all this talk. Let's do what? Let's head over to my laptop. All right, I've got a simple report here. Right now, there's no demo data loaded into the report. I've got three visuals. On the left, I'm gonna show survey counts using guest ID as a relationship key. This is the natural key or the business key. It's the identity the business recognizes. This typically comes from the source system. On the right, I'm going to show the same survey counts using guest key. That's the surrogate key. It's the key generated by the warehouse. And at the bottom, I'm gonna show you the actual fact table data so we can see what's really happening in the warehouse at the fact level. Now, let's add some data. So I have some scripts that I already wrote. So let's run this one. Let's add the first set of data. You can see it added a new guest with the loyalty tier of none. And then if I toggle through the different data sets, this is some bookings, and this one is some surveys. This is what we'll focus on. All right? Once the first set of data is added, the report looks clean. It looks really clean. The natural key version shows two surveys, which is correct. The surrogate And you can see that the loyalty tier is none, because this is a new guest, they don't have any loyalty. And if we go down to the fact table, the actual data in the fact table, you'll see two survey records. And so both of these are absolutely correct. So at this point, you might be thinking, "Why do I need the surrogate key? " And honestly, with the data as clean as it is, with a single version of the guest, that's a fair question. But let's take a look at the model and I'll dissect this a little bit for you. So here I have the dim guest in the middle, and then the two fact tables contain the exact same data. The difference is how they relate to the dimension. So if you look here on this side, you can see that they're related on the guest key, and it's a one-to-many relationship. And then over on this side, you'll see that we're related on the guest ID. That's the natural key relationship, and you may notice some things strange about this. It's a many-to-many relationship. It's a limited relationship. And one more thing I wanna point out before we go back to the report is the measures are simple. This is the measure that's just counting the rows on the fact table that's related on the natural key, the guest ID, the one the business knows. And then this is the DAX that is counting the rows in the table that's related on the surrogate key or the key that's generated by the data warehouse, right? That's it. Nothing fancy. I didn't do anything fancy with this DAX. The only real difference is the relationship key, and that's important because early on, both approaches appear to work. And if every guest has one row in a dimension, it works perfectly fine But everybody knows there's more data, and if we add more data, things change. So let me show you what happens when we add more data. I'm gonna go back over to my warehouse. I'm gonna add some more data. So we'll run this. And so what this is doing is this added another version of the guest, and so now the loyalty tier went from none to silver. And then if I go, you can see I actually have four surveys. But notice this, take a look. The guest key is different, but the guest ID is the same. And so what effect does that have on the report? So now look what happened. The guest still has the same guest ID, like I said before. From the business perspective, this is still the same guest, but now the guest has a new version in the warehouse. Different guest key, different loyalty tier. tier, right? Regardless of the relationship, same business identity, but a different warehouse version. Now look at the report. On the left, where I'm using the natural key relationship, the survey count is four, but that's wrong. The fact table only has two survey records per guest agent. So why is it showing four? Well, that's simple. Because the natural key is no longer unique in the dimension. The same guest ID now exists on multiple rows, and that's why we're getting the wrong answer. So when the fact table joins to that dimension using the guest ID, those fact rows can match more than one dimension row, and that causes the duplicate. The model is no longer pointing to a specific version of the guest. Now look at the right side. Using the surrogate key relationship, the count is correct because the guest key identifies the exact version of the guest, not just the guest as the business sees them, but the specific guest record the warehouse needs for historical accuracy.

### Segment 2 (05:00 - 06:00) [5:00]

And this is the whole point. Surrogate keys are not there because Power BI needs another integer column. They're there because the warehouse needs a stable way to identify a specific version of a business entity. When there is only one version of the guest, guest ID feels fine, it's clean. But the moment The same business identity can have multiple versions. Like in our case, the person went from no loyalty status to having silver status. The natural key is no longer enough. That's when the guest key really matters. That's when that surrogate key really becomes a significant part of the warehouse because the fact table can point to the exact version of the guest that existed when the fact was created. So do you still need surrogate keys? If your natural keys are stable, unique, and you're not preserving history, you may not need them. But if you need to track change over time, merge systems, preserve historical attributes, or avoid duplicate matches in your model, then the surrogate keys are absolutely still relevant. And that's the real takeaway. Don't use surrogate keys just because someone told you it's a best practice. I used to do that. Use them when the model has a responsibility that the natural key cannot handle. Surrogate keys are not about adding complexity. They're about preserving meaning, and in the warehouse, the meaning matters. So if you have any questions, comments, you know what to do. Post them in the comments below. If you wanna learn more about Power BI, Fabric, architectural design, go take a look at that video. And as always, thanks for watching, and we'll see you in the Cube.
