Next Big Thing for Data Engineers - Open Table Formats 🚀 (Apache Iceberg, Hudi, Delta Tables)

Next Big Thing for Data Engineers - Open Table Formats 🚀 (Apache Iceberg, Hudi, Delta Tables)

Darshil Parmar

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Оглавление (3 сегментов)

Segment 1 (00:00 - 05:00)

have you ever wondered how big companies like Netflix Uber or even if your favorite e-commerce website keeps track of ocean of data flowing through the system how do they manage update handle changes of all of these pedabytes of data without even missing a beat the answer might just lie in very simple concept that sound simple table format not just any table format today we are going to talk about Open Table format and how it is changing the complete landscape of data engineering and it might be the next big thing that you should focus on right now sck around because at the end of this video you will gain the complete understanding of what Open Table formats are some of the different Frameworks available such as apach Iceberg Delta Lake apach hoodie how they are transforming the entire data industry and how you can take the advantage of it before we start understanding them we need to understand the history why do we even need all of these things on the first place because once you understand the history everything will be clear again we will be focusing on the core Foundation First the world of data engineering is moving fast every day we generate 2. 5 quanon bytes of data companies in every sector like Finance Healthcare e-commerce streaming are looking ways to efficiently store query and analyze the data back in day we used to use Hardo and it was the center of processing Big Data it has sdfs which is like the Haro storage system and map produce a data processing framework to process all of this big data into different chunks and then combine the output at the end while hardup was working it has some drawbacks like managing harup cluster was very difficult and the ecosystem has like very steep learning C so if you want to learn harup you will have to hire expert it did not scale as per the need based on the data that we were getting also we used to store our data into the data warehouse so what used to happen is that we extract data do some transformation using Hardo and then maybe load our data onto the data warehouse now data warehouse requires you to store your data into the structured format only right you cannot store the like random unstructured data or semi- structure data previously data warehouse completely relied on storing the structure data so you extract data from multiple sources then you do the transformation and then only you can load your data onto the data warehouse now this is good but it also has challenges like if you want to modify something you know if you want to add a new column if you want to remove something if you want to change a data type in this case you will have to make lot of changes first you CH make the changes on the table level then you script level then you again backfill all of the previous data as per the changes that you have made so this keeps happening and the challenges kept growing one after another to solve this we have the New Concept called as a data L came as a flexible scalable solution offering a simple premise just dump all of your data into the cheap storage system like Amazon S3 Azor data l or Google cloud storage and then worry about processing it later as the data L grew with the popularity it also came up with its own challenge one of the biggest challenge was lack of asset transaction now in database we know that it supports the asset properties automic consistency isolation and durable that means every transaction that you make should complete or fail consistency means after running a query your database St should be consistent before you run the query isolation means like if two people are trying to query the database uh it should not have effect on each other and durability means in case of failure right you should be able to recover your data from the last checkpoint so this particular feature is available onto the database level but it was not available onto the data link because these are just a simple file stored onto the object storage on top of this there were many different issues faced by the data link such as schema Evolution so data structure change over the time right new columns are getting added some things gets removed new tables come all of these things are keep happening now data Lake were not able to adapt to this because they are just a simple file stored onto the object storage there was also a performance bottleneck so as companies realized that they need a frequent near realtime queries and became the clearer that just dumping our data onto the object storage is not the way to go you need the advanced index thing you need to partition you need the compaction you need the metadata handling all of these different things you need in order to query your data faster right you can't keep going to the file level and query your data you need some kind of indexing who can directly jump to the right data and fetch that also the Version Control and time travel so if you want to roll back to the previous changes if you want to recover data from the point in time you can't do that because again these are just a simple file stored onto the object storage all right now let's get rid of this so what we really needed is that something that has the capability of databases like asset properties and also the flexibility like data legs where you can just read the data as per your requirement enters the concept of Open

Segment 2 (05:00 - 10:00)

Table format the evolution of raw data dumps into structured high performance transactionally consistent tables onto the data L Storage this is just a fancy definition all it means is that it has a feature like support from the databases and also flexibility like the data Lake we'll explore the Apache ice bug hoodie and Delta Lake uh in The Following part of the video but let's try to spend some time why Open Table format on the first place what do they provide us why should we even use them first of all it provides the asset guarantee so we already know why asset properties are important traditional data L did not provide this because so if one job is writing and another one is reading then you might not have the data consistency available this particular problem is solved by the new Open Table formats available such as Iceberg hoodie which we call as a transactional layer we'll understand the architecture also the schema Evolution Were Made Easy lot of time you want to make changes to the structure of your data so if you want to add some column remove some column or even if you want to rename something you can't do that on the data link but Open Table formats handle them very easily you can always go to the older version of your data and interact with them it also has efficient metadata management system so again it's not just about storing your data it's also about storing the information about your data this is what we call as a metadata how many column does it have what is the data type about it how many rows does it have what is the max value of it Min value of it all of this information are called as a metadata of your data managing your metadata efficiently can drastically speed up the queries that users run this particular functionality is provided by the Open Table format it also provides the feature of time travel so let's say if you make some mistake and if you want to go back and you know reroll and understand what was the state of the data uh maybe yesterday or week before at this particular time you can also do that and other than that the open in Open Table format means it is not a propriety uh you can connect with the multiple engines like Apache spark fling pesto Hive trino and more so you can easily interact with all of the system and standardize your data so these are some of the fundamental reasons why open tables are widely adopted it has schema Evolution time travel asset guarantees uh it can connect to any system so it makes our life much more easier again the features like databases and flexibility like data Lake it is kind of like they taking good things from both of the system and combining it together now let's try to understand the different Frameworks that are trying to give us all of these features again in Market new things keeps coming okay we'll just focus on some of them again there are multiple Open Table formats available we'll focus on Apache hoodi databas Delta Lake and uh Apache iceberg in that I'll just go into a little bit detail for Apache Iceberg I will make a detail video onto the architecture about this if you're interested but for now we'll just get the overview and once we do that we can talk about in detail in some other videos so let's start with Apache ice buug it was originally developed by the Netflix one of the largest uh streaming media company they had data coming from many different things you have the user activity logs streaming analytics realtime operational metrics all of these different things you need a good system where you can easily manage all of your data so Netflix needed a system that can have the high concurrency manage schema you can easily time travel that is the reason they build Iceberg one of the core feature of iceberg is that schema Evolution capabilities so you can easily change your schema at any point of the time another great feature about iceberg is that hidden partitioning so in traditionally we used to partition our data and we know like where we want to keep our data but Iceberg handles all of these things under the hood so you don't really have to worry about partitioning your data it will store all of these things efficiently in the back end for you Iceberg also supports time traveling so if you want to go back in time at particular time interval at this particular date and time you can easily do that the ecosystem keeps growing you can always keep track of it just by visiting the documentation so let's try to understand the architecture of Apache Iceberg and I try to explain you in simple way but we'll go in detail in one of the videos okay so just get the overview first of all we have the data files at the bottom these are your actual data files stored onto the object storage like S3 these data files are Park files CSV files orc files whatever you want to consider right these are stored onto your actual object storage on top of it we have like multiple metadata information that keeps track of everything we do so anytime you make a change right what happens is that it creates the snapshot of it right uh it creates the version like different version like if you insert something if you delete something it will keep track of all of these different things that are happening so different snapshot of it will be created this information is tracked by the Manifest file tracks how many rows you have some of the statistics like maximum and minimum of the column indexing all of these different things are tracked by the Manifest file every time something changes or snapshot is created there are multiple manifest files are created like manifest file 1 2 3 4 like that and to track the Manifest file we have the

Segment 3 (10:00 - 14:00)

Manifest list available that keeps track of every different version of manifest files available it will have the information like snapshot C is store under the Manifest File 2 snapshot B one something like that to also keep track of the Manifest list we also have the metadata file available okay so each time table changes like if you add something remove something it updates and creates the Json metadata that captures the information about the snapshot schema partitioning and so on so there are like different layers of metadata information available all they do is just keep track of everything happening onto the data level you don't have to like remember them right now but just understand that all they do is just keep track of everything is happening around the data and then on top we have the catalog information that basically keeps track of everything that is happening under the hood whenever user runs a query such as select query to retrieve some data catalog will Analyze That what is the latest L version of metadata information that we have available it will fet the latest snapshot and then uh retrieve it to the user okay so these are like different layers available again don't have to understand all this thing right now just get the overview like this we also have the Delta L available Delta leg was developed by data bricks Delta Le is arguably one of the most mature platform available out of these three it is built on top of the apage par and has the asset transaction available again it has like a lot of different performance enhancement it uses the transaction lock to keep track of all the changes ensures the data is consistent even in the complex workflows and one of the most important thing is the integration of Delta lake with the data brakes they have like the close integration so if you're working with the data brakes you directly get the support for Delta l so you don't have to integrate with any other external system you directly get the best Open Table format available to you and lastly we have the Apache hoodi which stands for Hardo upset deletes and increments as the name suggest hoodi was optim optimize for real-time data inje and incremental data processing it has the key features like upsert so you can directly update and insert your data as per the requirement so you don't get the duplicates the upset feature makes the CDC change data capture much more easier so you don't have to you know keep updating and deleting your data you can directly update and insert as per your requirement that makes the realtime analytics much more easier and also it has supports for time travel and data compaction so as the data size grow you can also compact your data in single file so that you don't face the storage issues now again this was just a overview right we just understood the best of all of these different Frameworks but the real question is an Open Table format really the next big thing the short answer is yes because companies need flexibility they want to scale they want like efficient data management system Open Table formats gives that right we want the data L that behaves like the transaction system without sacrificing cost efficiency and scalability of the cloud storage in short Open Table formats brings the best from the data warehouse and also data Lake offering the best of both world as the data grows exponentially the importance of Open Table format will also increase here are the some of the advantages of Open Table format single source of Truth so you store your data once onto the Open Table format and anyone around the organization can easily use it as a transaction it will guarantee the integrity and consistency of your data schema enforcement so you will avoid the garbage data like lot of times you have the corrupted data inserted so you can easily keep track of it performance will be near data warehouse level so you can have the flexibility and a good performance and time travel so if you want to go back onto the history and keep track of the previous data you can also do that so this was the quick overview of the Open Table format and why it might be the next big thing if you want me to create the detailed video onto the specific table format like Iceberg Delta L then do let me know I do teach Delta L onto my Apache spk cod with the data Brak so if you interested you can check the link in the description if you have any video suggestions do let me know like the video If you enjoyed and learn something new subscribe to the channel if you're new here thank you for watching I'll see you in the next video

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник