# Materialized Lake Views in Microsoft Fabric: How They Actually Work

## Метаданные

- **Канал:** Havens Consulting
- **YouTube:** https://www.youtube.com/watch?v=pAxKtzuP9cQ
- **Дата:** 02.06.2026
- **Длительность:** 18:06
- **Просмотры:** 825
- **Источник:** https://ekstraktznaniy.ru/video/52941

## Описание

Every report hitting your raw tables runs the same expensive joins and aggregations over and over, even when the underlying data hasn't moved. Materialized Lake Views in Microsoft Fabric pre-compute those results once and store them as Delta tables in OneLake, so consumers read finished numbers instead of recalculating from scratch.

In this video I cover what MLVs actually are, how the automatic refresh logic decides between skip, incremental, and full rebuild, the Spark SQL syntax to create and manage them, and where they sit in a medallion architecture (think of them as a "Gold+" layer).

The part worth sticking around for: point Direct Lake at an MLV instead of your raw tables and you get pre-aggregated data at import speed. In the example here, a 50 million row fact table collapses to around 500K. Fewer rows means faster transcoding into VertiPaq and a lot less DirectQuery fallback risk. That's the single best reason to pair the two.

I also get into when NOT to use them, because 

## Транскрипт

### Segment 1 (00:00 - 05:00) []

Hey data fans, Reed here. Today, I want to talk about a new piece that can fit inside of the Fabric platform and your ecosystem and medallion architecture. It's a concept of something called materialized lake views. So, something that can kind of fit into place essentially as an aggregation table for data that is processed and summarized at a higher level than some of your base facts, but also might be getting queried very often and costing you a lot of compute usage or just cost in general in your Fabric capacity. So, I want to dig into that a little bit, what they are, when they're needed, how they're used, and a bit of the logistics around it. So, with that being said, let's hop into the interactive guide and get started. So, essentially, materialized lake views are precomputed performance in Fabric that can eliminate expensive recalculations, help to accelerate queries, and potentially simplify some of your semantic models. So, let's first start about and talk about the performance bottleneck and the idea around whenever report needs to recalculate the same expensive queries. So, you might have questions such as like, how many of the reports that you have that takes 30 plus seconds to load? Will often get some engagement and a pushback from ecosystems that you have. Now, the scenario can come about from often things like large joins as you can see in front of you here, group by aggregations that are in your DAX measures, window calcs, or distinct counts. Various reports and notebooks can often be asking and or looking for various cuts of data and query plans that are often at very similar granularities coming from tables via direct query, sorry, tables lake sitting in lake houses and warehouses. So, essentially trying to lay the groundwork for what the problem might be. And the cost from that comes with various CPU spikes, slow report loads, and or potentially direct query fall back. So, lots of problems that can come around because the data has not been precomputed and you're needing to pull at such a low level of granularity with such a large table that you end up with some of these or all of these various problems. Now, that's where the idea of materialized lake views comes into play. These are precomputed query results persisted as Delta parquet tables. Core concept being that a regular view generally has virtual to uh From a core concept perspective, regular views are virtual. They have no physical storage and they re-execute the underlying query on every access. Full compute cost each time, but they're always live, but you have to pay for that fresh data on every read. So, it's a saved query, not a saved result. Versus a materialized lake view, which persists as a Delta table in One Lake inside of the lakehouse, it reads stored results. It's minimal compute on query and it's auto maintained by Fabric. So, it's both uh so, in this case, it is a saved result, not a saved query. So, it's actually cached data. Now, by definition, as I mentioned, it's a precomputed query result defined by Spark SQL that is automatically persisted as a Delta table in One Lake. And unlike regular views that are re-executed on every query, materialized lake views store the computer results and are automatically maintained. So, I want to emphasize that it's precomputed and it's Delta in format and it is auto maintained. And the cool thing about this is that it leverages Spark SQL to write it. So, you can write a SQL statement that, as we'll see, gets transformed into what's needed by the Fabric engine inside of Fabric. So, key properties with this. As I mentioned, precomputed, Delta stored, auto maintained, and Spark SQL defined. So, this is a big one. It is authored in a Lakehouse notebook that can leverage create materialized Lake View function to basically turn your SQL query into something in the back end that gets stored in Delta Parquet format. Now, how do they work? From the Spark SQL definition to automatic maintenance. So, I want to walk through some of this. In this case, you write a Spark SQL query defining what the materialized Lake View should contain. So, Spark SQL is SQL dialect that runs on Apache Spark. And materialized Lake Views are authored in a Lakehouse notebook cell, saved as Delta Parquet. So, you write SQL, the engine will evaluate it, determine what is necessary from joins, aggregations, or filters, and produce the Delta table. So, there's a bit of translation magic sauce that happens where you can write known SQL and it will convert this for you leveraging the Spark engine. Now, storage and access. Scrolling down a little bit, the materialized Lake Views Delta table lives in OneLake alongside of your source tables where any Fabric engine can read it. So

### Segment 2 (05:00 - 10:00) [5:00]

leveraging Spark, SQL, or Power BI, all of those can retrieve the data from here. It is just a standard Delta table by the time it is materialized for any downstream items or dependencies. Also, with the idea around automatic maintenance. So, on each scheduled refresh, Fabric's optimal engine compares Delta logs and picks whether or not it skips it, incremental load, or full automatic load. No manual orchestration is needed for this. So, the refresh frequency is set in the Lakehouse and the materialized Lake Views, and then the schedule is pain. So, again, kind of once set and done for the most part, which compared to manual egg tables that need pipeline orchestration if you were to build these yourself. So, that's where some of the automatic maintenance of this comes into play. And as they mentioned here, uh one prerequisite is the change data feed. So, incremental refresh does require the delta enabled change data feed equals true. So, this is important because when you're working with materialized lake views because without this, you're stuck between choosing a skip or a full rebuild. So, it's easy to forget, but easy to debug once you know it. Now, also transparency on query routing. This is the magic moment. The query routing is transparent. Your existing reports get faster without changing any code. So, the query optimizer knows about your materialized lake views. When a query matches a materialized lake view definition, the engine can transparently route through the precomputed results instead of recalculating from raw tables. So, this means that when you use one of these, you don't have to write separate measures in your DAX queries or anything else to leverage this. This automatically grabs this. So, think of this like the Power BI egg tables, aggregation tables back in the day, but without having to set up any definitions, without having to do the mapping yourself. The engine knows when it can use this. If you are already doing direct lake mode to a lakehouse or any semantic layer on top of that, and you add this to that item, this will automatically get used if and when it's necessary for faster performance. That's where a lot of this magic happens with materialized lake views. It's very cool in that regard. I want to also just show a bit about what some of the Spark SQL syntax could look like. So, it is in many ways essentially an inheritance of standard SQL, but meant to run through the Spark engine. But, this is something where it does not require any new language to learn. This is just one of many patterns that can come through this, but just gives you a rough definition of how this could look if you were to going to be using the Spark SQL editor inside of the Fabric environment. Now, similarly, there as a quick note, there is also options for altering and dropping similar to just standard SQL development. But, these can be done in separate operations as necessary when you're making alterations or changes. And as well, scrolling down, we do have options for metadata. So, some of the queries that we can do if we need to pull in information, we have the ability to retrieve a lot of the statistics and metadata around any of the information from the materialized lake views. Now, questions for a trio of patterns on when to use materialized lake views. So, ideal patterns of when and maybe when to avoid them. Pattern number one, heavy aggregation. So, sums, averages, counts, etc. Classic materialized lake views use cases where you can precompute once and retrieve this data many times. So, basically brought into a higher level. Again, as a comparison from the old school Power BI aggregation tables, very similar parallels to this, but now in Fabric. Complex multi-table joins. It can help to flatten fact dimension joins into a single pre-joined table that is instantly queryable. If you have certain common join patterns that you're trying to accommodate. And frequently queried subsets. So, filter data sets. Examples, last two years, top regions. So, any common patterns that you're noticing maybe from some audit analysis of common queries being ran in your tenant, those can be potential benefits to turn into materialized lake views as well. Now, when not to use. Couple of things to think about here. So, sub-minute volatile data. If your data's changing every few seconds, very fast occurrence of updates with data going into there, one, you might want to consider why are you using a lakehouse versus a SQL database or other type of Postgres for quick data being entered. But those are scenarios where you may not want to be using materialized lake views versus just direct query scenarios or something else. Another one is small simple tables. Tables that are already small enough to query directly, the overhead of maintaining a materialized lake view might not be worth it. And uh you want to consider row level security uh needs. Materialized lake views do not natively support RLS. Per user filtering still needs to happen downstream, especially if you've filtered and aggregated up to a level past where natural row level security can filter by. So, something also to consider. I like to have a decision framework as well of some yeses and nos. So, questions you can go through. Is the query expensive?

### Segment 3 (10:00 - 15:00) [10:00]

Nope. Then no materialized lake view is needed. Question number two, is it queried frequently? The answer is also no. And the cost is acceptable, you don't need a materialized lake view. Third question, source data table without uh sub minute changes? If that is not the case, then no, you can use direct query for those fast occurring tables with frequent updates. Otherwise, if any of those three are met, that's when you can move down to uh potentially using a materialized lake view. So, they are useful, but there is a specific scenario in combination of that needs to justify it to add that technical debt and add that complexity. Now, materialized lake views versus direct lake, things to think about. So, the two of them together allows you to be able to direct lake to materialized lake views to get similar import like speed against that pre-aggregated table. Because they are stored as Delta tables in a lakehouse, they will appear in the SQL endpoint and for semantic model designers like any other table. You can just pick the materialized lake view in building or editing the model if you want to access it directly rather than letting the engine do it automatically. So, that's the great part about the VertiPaq engine is that it can either automagic itself into grabbing it as necessary when it finds the right query pattern if you are leveraging it that way, or you can reference it directly if you want to point it to and add it to a model design. Scrolling down a little bit on the VertiPaq um advantage, fewer rows equals faster trans-coding into VertiPaq. So, the aggregation ratio will depend on your data, uh but often between a 10 to 100x speed reduction could occur in certain demo environments. Again, real world is going to fluctuate a lot, but there is many scenarios where you can easily go from a number of rows from 50 million down to about 500,000 or less. So, just like the conversation with aggregation tables historically back in the day with Power BI, these follow a similar pattern for those. Fewer rows, faster trans-coding, less memory, and reduced risk of fall back to direct query versus direct lake if you're using those scenarios. So, that's the number one reason to pair materialized lake views with direct lake is to avoid that fall back. Now, a full architecture pattern, uh materialized lake views can sit kind of like a gold plus layer potentially in uh your medallion architecture somewhere between gold and the semantic layer. So, it's the processed data, but then pre-computed to an even higher level. So, I think gold plus is kind of a nice way to think about this if you were to think about it in that medallion structure. And as a brief note, we do have the direct lake and medallion architecture guides that are available, so feel free to check either of those out if you want to explore some of the other interactive guides that I've written. Now, to summarize it all with some best practices and limitations. Start with the slowest queries. Identify your most expensive SQL operations first. Those will deliver the biggest ROIs as MLVs, and I'm just going to refer to it as that cuz it's hard to say materialized lake view repeatedly. Monitor cost versus savings. They do consume storage and refresh compute, so just be aware of is it going to be more or less expensive overall from a compute usage on your fabric capacity. Keep your definitions focused. Narrow purpose-built MLVs will outperform wide kitchen sink views combined with partitioning. So, partition source tables before creating materialized lake views, incremental refresh will become even faster for them. Use good naming conventions, something like a prefix with MLV underscore to help distinguish them. And test with real workloads, possibly in dev or test environments before rolling this out to production. Some cost considerations. So, they do trade storage and refresh compute for query time savings. So, it's good to understand the billing model to help you organize the right approach for this. Refresh compute, so it does run on Apache Spark. So, that will use compute units and it will classify as a background operation with smoothing. So, things to consider. The OneLake storage, they do persist as Delta tables. So, storage will be built separately per gigabyte as whatever your cost is related to that. And generally speaking, that refresh strategy, as I mentioned, it's set to auto refresh and auto selects skip, incremental, or full rebuild. That's just built into it. So, that automatic is something to monitor and see how that is performing. And a key insight, again, aggregation-heavy MLVs will always use a full refresh because incremental mode only supports append-only operation. So, factor this in your capacity planning for larger aggregation tables. Current limitations. So, CDF is required for incremental. Again, incremental refresh only works when the source tables have Delta enabled change data feed equals true. Otherwise, a refresh will be a full rebuild. Spark SQL authoring only. So, authoring in a lakehouse notebook via Spark. PySpark authoring is in preview, but only supports full refresh. Storage cost. So, each MLV is a physical Delta table. So, you're trading storage for compute savings. So, just monitor your OneLake usage. Refresh latency. So, automatic maintenance is not instant. There is a

### Segment 4 (15:00 - 18:00) [15:00]

latency window between source changes and the materialized lake view update. I forget the exact number. I know it can be multiple minutes. Sometimes, I think 30 to 60 minutes last I checked. Don't quote me on that, but there is a noticeable latency delay. So, just be aware of that. Again, which is why they're not very good with real-time data because you'll often have a bit of a delay between the two of them. So they're they work much better with data that is um a couple of times a day, daily, or weekly. Alter is rename only, so changing the select constraints or partitioning means re-running create or replace. Alter itself only supports rename too. And last but not least, single lakehouse scope, cross lakehouse lineage, and execution are currently not supported. So all source tables and dependencies for materialized lake views must live in the same lakehouse. And they are an evolving feature. They're actively expanding Spark SQL coverage, PySpark authoring, which is currently in preview, and maintenance capabilities. So please encourage you to check the latest documentation for current limitations on this. Couple of final comparison tables for takeaways. So materialized lake views versus other pre-computation strategies. So regular views, no storage, no maintenance, but every query re-executes the full compute. You get the real-time freshness, which is about the only one, but everything else loses that scale for that for a standard view. Materialized lake view, that's the teal column here. It's Delta stored, minimal compute on query, optimal refresh, automatic maintenance, direct lake compatible, best for heavy repeated queries, and in Fabric, they give you the best of both worlds. Looking back a little bit more for like legacy items, uh a manual aggregation table that you've built yourself in some capacity through a data pipeline or something, same Delta storage, minimal query compute as a materialized lake view, but you own the refresh pipeline. But that adds higher complexity, manual maintenance. So pick this when you need custom ETL control that the materialized lake view will not give you. And last but not least, the legacy item, but Power BI aggregation tables, imported into the model, has dataset refreshes and manual config, but direct lake only works in dual mode for these. Um they're still valid for existing Power BI aggregation deployments and for new builds, um but generally uh, for new builds, sorry, the lake views are usually the better answer. But overall, I'm hoping this gives you a nice introduction to the concept of what a materialized lake view is and something that you can add inside of Fabric. Uh, with a lot of these guides, I'm trying to aim for concept rather than do. Just with how quickly UIs and functionality is changing these days, I feel like guides are a better use to provide information to the community on things to use to then go explore themselves rather than just me clicking around in a demo. And it's a better resource to be able to graphically and holistically see some concepts. But if you like this, drop some comments down below in the comment section. Check out some of our related guides in the related sections. Videos in the upper left are related videos and as always liking, commenting, and subscribing will help the channel grow. With that being said, I will see you in my next video.
