# Apache Iceberg Explained in 10 Minutes – Everything You Need to Know!

## Метаданные

- **Канал:** CodeWithYu
- **YouTube:** https://www.youtube.com/watch?v=qHCuDOCJRaQ
- **Дата:** 29.09.2025
- **Длительность:** 11:54
- **Просмотры:** 3,304

## Описание

In this video, you'll get a complete overview of Apache Iceberg — a powerful open table format for big data and analytics in the cloud. Whether you're a data engineer, analyst, or developer, this guide will walk you through everything you need to know about how Apache Iceberg works, why it matters, and how it's different from formats like Delta Lake and Hudi.

WATCH FULL END TO END VIDEO https://youtu.be/2okBfWEMaFE

Like this video? Support us: https://www.youtube.com/@CodeWithYu/join

Hashtags:
#ApacheIceberg #DataEngineering #BigData #CloudComputing #DataLake #Flink #Spark #DataPipeline #TechTutorial #LearnDataEngineering

## Содержание

### [0:00](https://www.youtube.com/watch?v=qHCuDOCJRaQ) Segment 1 (00:00 - 05:00)

So Apache iceberg is open table format that was developed by Netflix around 2017 or thereabout. Now this is now open source and everybody contributes to that. But to better understand how this is important and how useful this particular open table format is especially in the new modern data lake format then we need to understand how this came to be. If you can still recall essentially when we started uh as data engineer with the initial integrations you started with um the good old ELT ETL isn't it ETL so which gives us access to do the extraction from multiple sources transform them with some layers and then load them to some destinations or whatever location you are then come about the ELT which is the extract, load and transform as um the name connotates. Uh but essentially what this is about is you have all data stored in some data lakehouse format uh data lakehouse and then you can do the transformation and do uh extract what you want essentially from individual files that is inside this um data lakehouse. Then the last bit is the new um open table format which is the F. Uh essentially it's not an integration um paradigm but essentially this is like a way you interact with the data especially if you're using the ELT uh in these um modern data lake houses. So, ETL essentially you have a central data warehouse which is going to be your your centralized locations where all the data will be loaded into and you can have as many small data sources in here. It could be a single um data warehouse that you're trying to migrate or you're trying to integrate into or some provider are sending you some data or you are doing the extraction yourself wherever it is everything is getting loaded into this particular data warehouse and of course you can have multiple integrations uh between these layers and then eventually load them to some data warehouse. So, of course, this is going to be using something like if you started with something like um SSIS, you can still recall um that works just fine or use something like talent uh or you use something like uh as you go along something like uh Apache Airflow for some of these transformations which is which works just fine and then com DBT uh which helps you to do both ELT and ETL layer. So essentially your data warehouse is getting integrated and loaded into data warehouse period full stop and this essentially enforces some formats like the schema and this helps you to better have better confidence inside of this data like the durability acidity um consistency and stuff like that. Essentially uh data is there. Now uh come the uh we come to the era of the lake houses where you can have even data warehouses inside your data lakeous and then that gets integrated with multiple data warehouses multiple data files essentially as many data as possible. It could be audio, it could be video, it could be files, it could be JSON, whatever data format can just be loaded into the data lakehouse. This gets messy just pretty fast and your data lakehouse essentially just sits there. You can now do the extraction and do the transformations that you want for visualizations or whatever insights that you want to derive from there. Okay. Now, so we migrated uh essentially into the era of data lakeouses and this is where something like a prominent uh dbt comes into play. you can now uh extract data into or put them inside some external tables. If you're using something like Redshift, put some files into Red Shift um external tables and then do the transformations from there. Uh this works just fine essentially because this is uh more like your S3. I'm just using S3 as like the general versions because you have different versions on uh Azure different version of AWS, Alibaba cloud and the rest of the uh data providers uh cloud providers. All right. So the data sits there you do the uh transformation but this doesn't enforce schema. It doesn't uh because now you have to like be the one to design essentially how you want to present this data. It could be straight from your layout to some visualization layer. It could be from lake house to some other um trans uh another folder inside the same lake house or another table or whatever it is it just gets uh messy really fast you know and it's not as

### [5:00](https://www.youtube.com/watch?v=qHCuDOCJRaQ&t=300s) Segment 2 (05:00 - 10:00)

messy is just like the way you want to present the data u essentially now come apache iceberg now apache iceberg as we were saying helps you to understand how best to put the data inside the lakehouse and better connects them uh you know for presentation purposes. Now, this f essentially has uh a barrier in this case for working with it. So, each of your files that you have in your S3 essentially is going to be somewhere here. So, you can have let's say P, you have maybe Havro or you have the OC file. Now it could be as many p as many avo as many RC files maybe support for JSONs and the rest like that but all of these are like the raw files that you currently have. Now you have a layer on top of this. On top of this layer you are going to have the manifest file and this manifest file is going to be connected to as many of your raw data file that is inside of the lakehouse as possible. So think of let's say 2000s. Yeah. So in this case you could have a single manifest file that contains all the path to all these files that are sitting there whether p aro or oak file. So in this case you're going to have the manifest file. So I'm going to put this I'm going to change the color to separate to you to show that it's um so you have the manifest file. Now the manifest file could be just connected to just one of this, two of this or a combination of all of them. Then you can also have another manifest file in here which connects to just this one or and this one and another one just connected to the first one. So manifest file manifest file. Okay. So that's how the structure of uh manifest file looks like. It's a little bit messy the way I drew it, but essentially you can have a single manifest file with different combinations of all of these raw data files. Now, on top of this, you have the manifest list. So, I'm going to put this as manifest list. Let me change the color to blue. So, you have the manifest list. So, I'm going to say this is the manifest list. So, I'll put this as manifest list. Yeah, manifest list. Now, this manifest list can be connected to multiple manifest files. So, you can have let's say your raw files is in hundreds of thousands. Yeah. But your manifest files will be in let's say hundreds or even thousands. Then your manifest list will be in tens or probably hundreds. Then on top of that layer, we continue. But for now just think of as you have the essentially the raw data file uh sitting as in in their numerosity uh you have them a little bit lesser in the manifest list containing an array of all of these files. So you have the manifest list connected to a single or just um all of them uh as you go along. So you can have manifest list one, manifest list two, manifest list three as many as possible. Now the concept is getting clearer now because you understand the raw file is connected to manifest file. Manifest list. Now what creates all of them is called the metadata file. Now this metadata file is what really sets things um apart. So manifest list and you have the meta data file. I'll call it met F. You could call it MFL. whatever you like. So but the metadata file what it does is connect to a single snapshot of all of these manifest list. So if you have let's say u 10 manifest files inside of a single manifest list. But what that means is if you remove a single file from the directory, you delete it, you add a new one or whatever it is, you've changed the structure. What manifest file is going to do is uh make sure that individual snapshot at a particular time is maintained. So you could have snapshot zero, snapshot one, snapshot two, as many snapshots as possible inside a single um metadata file. Now this metadata file is like a single source of truth where all the files are connected to the manifest list and the manifest files and essentially the raw files uh as it cascades down uh through throughout the system. Then the final layer in this case will be the catalog.

### [10:00](https://www.youtube.com/watch?v=qHCuDOCJRaQ&t=600s) Segment 3 (10:00 - 11:00)

now this catalog points to the latest manifest metadata file. Yeah. So the latest metadata file is going to be where your catalog is s sitting. Uh think of your table. Yeah, your table is going to be connected to some data in your you know normal uh RDBMS system where you have everything stored in some uh in some tables. Yeah. But inside of paradigm you have the catalog which is the table uh in this case connected to the metadata file. the metadata file that will be the latest metadata file anyways. So that will be connected to the latest snapshots of all of the files in the system. So the snapshot will now be connected to the metadata list uh manifest list and the manifest list connected to the manifest file to the raw data. So you have this connected to or pointing directly to the metadata file in this case which is snapshot zero snapshot one and this snapshot zero can just be part of this. This could be just this one and um this could be this and this could be that. But inside of that you you have essential control of what happens in the past what is currently happening or and what has maybe happened so much uh far back in the past. So the present situation of the the data lake is maintained inside of this metadata file and you go down to the meta manifest list then the manifest file then the raw data file. So this is going to be your table uh which is pointing to the latest manifest file. So essentially this is um what the architecture of um Apache iceberg look like uh under the hood. So, I know this is kind of interesting, especially if this is your first time of getting to know what Apache iceberg

---
*Источник: https://ekstraktznaniy.ru/video/52952*