Interoperability of Delta Lake table format in Fabric

Interoperability of Delta Lake table format in Fabric

Machine-readable: Markdown · JSON API · Site index

Поделиться Telegram VK Бот
Транскрипт Скачать .md
Анализ с AI

Оглавление (4 сегментов)

Segment 1 (00:00 - 05:00)

hey everyone Welcome to our Channel dedicated to Microsoft Fabric and this is a series dedicated to data engineering data science data integration and doing everything within fabric with data coming from Delta Lake format today we have a special guest Daniel uh thanks for joining us for uh it's I think that the fourth time that you are joining thepress series yes thank you for having me it's a pleasure to be here yeah uh that's great to have you here especially that we have uh there lots of announcements related to different times of cataloges tables data formats uh lots of things are happening from different sources internal and external I would love uh to start with start the discussion with you by asking the first question what uh has changed recently on our side and what do you right now on it's a lot of work I would say I would tell you that so my main uh Charter right now is to take care of Delta Lake uh across all Microsoft fabric so Delta Lake as we have discussed at least on three episodes in the past it is the unifying table format for Microsoft fabric so that means that across all the workloads across all the experiences uh you have a Delta table that's probably written once by any of the engines and being able to be read by all the other engines so that's how we deliver the one copy promise of Microsoft fabric you have one Delta table in one Lake and all the engines and all workloads experiences are driven from the same Delta table so what I hear is that we have so many different engines there is a different implementation for each of them can write to a Lakehouse in the heart of the lake house that is a Delta Lake format Delta lake table so I can use any engine to save the data and read the data is that correct at any combination that you want amazing uh that sounds challenging knowing that those engines are super different we had an episode about comparing Lakehouse versus uh data warehouse not only from the uh perspective of the storage but from the quering engine perspective so could you summarize what are the main patterns that you face uh which our customers are following when working with fabric when working with and Delta uh Lake tables of course so when we look at Delta in the Microsoft fabric context we have three main patterns one is a ingest uh from any Source you using the Microsoft fabric engines uh also a very strong pattern that a lot of customers use is just integrating to whatever they have don't move data if they don't need to and of course when you have data warehouse and operational stores to powerbi analytics that's where we want to be right we want to just uh attach to their warehouse and start reporting right away so those are the three main patterns that we have on the first part uh if you're already using uh micro soft technology to ingest uh on fabric or not especially like as data flows data pipelines The Lakehouse load to tables capabilities and the mirroring everything is already Landing as the Delta leg format so just after immune Gest it's already ready to go uh you start can start using all the other engines to query your tables to uh you know change our data and to start building reports right away if you're just integrating with external resources the primary way you integrate is just using shortcuts imagine you already have Delta L tables available elsewhere uh just shortcut them into the lake housee any Medallion layer FS of course and then you start building your reports right away um if the table is not Delta you can also use our experiences to convert it to Delta in place or to use a data copy operation using spark or data pipeline so something like that to convert your Delta to Delta tables and finally data warehouse is already generating uh Delta tables for you just Mount them into the lake house and then you start leveraging the direct Lake technology right away does the uses the reordering technology to uh achieve like a spectacular performance and finally like once it's all there you already have your data tables you know coming from any Source or integrating th any from any Source uh

Segment 2 (05:00 - 10:00)

you can also just apply Transformations using spark data pipelines and then you start leveraging using SQL analytics endpoint notebooks you everything is ready for you to consume using any other engine and some people ask okay but which engine should I use the engine that you're most uh skilled with uh you know maybe you're very proficient with you know TC co uh or you're very proficient with powerbi just leverage that technology leverage your knowledge and just attach to the Delta table and start rocking away it sounds that all of the engines that are integrated a first class citizen on Microsoft fabric are contributing to a data platform system when customer can just choose whatever they want from pipelines and in data Factory there are different ways of processing the data uh like data flows including to data engineering then data science then data warehouse then real time analytics that has been announced a few weeks ago very strong and always popular DW and of course the heart which is powerbi what are the principles that you are following when introducing and working on the heart of the lake house so Delta and what are the challenges oh I that slide I was looking into okay so the main principles like so the main work that we do is that we focus on making uh every engine under fabric uh what we call fabric native engines uh are Delta enabled right so the strongest principle we go by every day on everything that we do is that if one compute engine or workload uh wres a Delta table every other engine or workload needs to be able to read it and that's already gives us a matrix of and and a group of challenges we need to go by and we are very transparent in terms of what's working what's not working the Delta Lake features that are available and we go for Max compatibility we also uh we align that's our strongest second principle is that we align to the Delta Lake open source uh protocol specification we contribute back directly we have many contributions made back to the Delta Lake Library uh just like we do for Apaches spark and Apache Park and other projects out there as of today and that's my main Challenge on my work on a daily basis is that today we have 12 engines inside Microsoft fabric that we're pushing to be 100% uh compliant and fabric native Delta uh that means that there is a huge committee uh that we drive we have weekly meetings we have semester plans we have goals very strict goals on making sure that every engine is protocol compliant uh all the Delta types behave everyone can read everyone can write and then we move the bar uh as a whole as as features land we also need to go beyond because uh the reference Delta Lake implementation out there is pretty much a scal in Java right there is a newer uh Del Delta rust library but still in a not production grade state so all we have is scull and Java and you know our engines are not built only in scull and Java spark is but you know most of our engines are either C++ or C so we go beyond and we have also uh internal implementations so we need to catch up quickly bring all the features to the C++ and C implementations and move the need as a group how do you track it is there anything as a customer as a user of fabric I should read take a look yes so the way uh you know people should track and also provide feedback and get in touch with us if we're you know dropping the ball or if we're late but the way to keep track is and even before I go there it's important for people to grasp that the Delta Library version doesn't mean much right uh the the capabilities of the Delta table are driven by uh properties that are within the table itself uh there are two main parameters there that are the mean reader version the mean writer version that control the capabilities uh of the features that are uh to be used in that Delta table the point being is that do not over index on the library

Segment 3 (10:00 - 15:00)

version focus on what the table version is and what features are enabled on that table okay but how do we keep track right so how do I know uh what I can enable on a table or not and if that thing is going to work or not in fabric the way you do that is that we as I mentioned before we're very transparent in our uh documentation so we have a fabric a Delta interrupt uh documentation that we keep updated at every uh Milestone that pretty much tracks the all the engines that we have all the workloads and all the Delta table features and the mean table reader version and me table writer version that's enabled by default on each engine so if you want to know okay can I use addition vectors or can I use column mappings do I support liquid clustering or not uh you should go and uh check that documentation now something is missing feedback button let us know and uh we're going to come clean on that information Daniel that's uh huge big and challenging uh I'm super happy that we are pushing that and driving forward in a way that customers are getting a predictable smooth experience across engines can you share from your site what's the future that you are looking and then building for Delta Lake as a heart of Lake housee yeah under the principle that we always focus on Max compatibility right it's one PE one engine writes all the others need to be able to read so that's our driving principle every day with that we need to bring new uh Delta features in right Delta as of June 2024 when we're recording this uh Delta 4 is about to be released there's a lot of movement in the community on table cataloges and newer features uh for Delta so I'm going to give you some examples of things we're working on already we have like deeper technical ones such as we're standardizing encoding of data types and we're optimizing the RO groups which are kind of key uh fundamental pieces for performance uh we're also uh standardizing the in checkpoint V2 which is a specific protocol implementation so those things are kind of happen on the back end things that you shouldn't be thinking about we think about because we need to align across multiple engines uh we're also aligning on the newer capabilities of Delta 3. 2 and 4. 0 uh how we're going to incorporate and most importantly move the bar uh on all engines to accept uh those new features in also we have like big customer facing ones that we're going to be shipping over the next semester which for example is uh we're going to enable column mappings some people would call it pretty column names uh across all the fabric experiences right so those are kind of some of the key things that are coming forth part of my job is to make it uh is smooth right so uh for customers to understand that in fabric all the Delta experiences are aligned and um whatever feature we're going to bring is going to be brought in with Max compatibility and uh no disruption right in a safe way I think that's the most important message here you said the column mapping is coming also uh Del vectors as well so pretty fancy features for Delta L are coming meaning that they are as of now because we ship the library but customer has to configure minmin writer version but what's coming is that we'll enable that by default not only for a spark and not only for DW for example right it's all about moving the bar yeah it's as a group right and really making that feature not on an engine per engine basis but as a unified experience across fabric yeah makes sense as now they can like customers can use it but in one just a few selective engines so yeah that sounds challenging at the same time I know that customers are waiting for it so thanks for pushing that good for all who are watching us please remember to leave the comment in a comment you can type your question or maybe the name of the episode that you wouldd love to see from me from Daniel from other PMS we are here to deliver the content about the functionalities that we are building for you so remember to leave the comment like button uh And subscribe maybe just also

Segment 4 (15:00 - 15:00)

share the uh the video with your colleagues uh also remember about the website ideasfabric microsoft. com if you have an idea how we should improve our product our features within the product let us know we are reviewing that on a daily basis so until the next time happy exploring the Delta Lake features across all the engines thanks a lot thanks Daniel thank you

Другие видео автора — Azure Synapse Analytics

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Дайджест Экстрактов

Лучшие методички за неделю — каждый понедельник