# Data Engineering Roadmap For 202*

## Метаданные

- **Канал:** StartDataEngineering
- **YouTube:** https://www.youtube.com/watch?v=AzKgOlmNRa0
- **Дата:** 15.04.2026
- **Длительность:** 5:46
- **Просмотры:** 362
- **Источник:** https://ekstraktznaniy.ru/video/52945

## Описание

Blog: https://www.startdataengineering.com/post/data-engineering-roadmap/

Trying to upskill as a data engineer? You most likely have come across one of the many data engineering roadmaps that list a long set of tools. 

If you are:

* Wondering how to convince recruiters and non-technical hiring managers to interview you, when you don’t “know” a tool

* New to the career and overwhelmed by the proliferation of tools

* Worried that LLMs will take away all data jobs

This video is for you.

00:00 The problem with DE roadmaps
00:40 Fundamentals and best practices
02:47 Process to learn a new tool
03:11 Example: Apache Iceberg
05:31 Conclusion

## Транскрипт

### The problem with DE roadmaps []

Hello. In this video, I want to talk about how you can learn data engineering in 2020s. The road maps online seem to be getting longer each year. There are tons of new tools, tons of new technologies. It can be very intimidating to try to get into data engineering. If you're overwhelmed by all these tools, if you're worried about getting into the industry when you don't have experience with a specific tool, keep watching. In this video, I want to talk about an approach you can use to learn any new tool quickly and how you could understand a tool's trade-offs and use

### Fundamentals and best practices [0:40]

cases. In order to do this, you need to first understand what the fundamentals and best practices are. Fundamental concepts are the building blocks of your data pipeline. These include data storage, data movement, distributed data processing, metadata lineage, observability, scalability, and orchestration. Like coding like Python and SQL. The best practices represent patterns that enable easy-to-maintain pipelines. Things like data modeling, data architecture, item potent pipelines, lambda architecture, data quality checks, etc. And then there are tools that enable you to implement fundamental concepts. So, tools like Spark, tools like Chron, Iceberg, they enable you to implement fundamental concepts. They create an abstraction on top of the fundamental concepts and you can use that abstraction. For example, Spark enables you to do large-scale distributed data processing using the data frame API. And you don't have to get into the nuances of how that data is processed in a distributed fashion. Frameworks are standardizing best practices based on industry uses. So, things like medallion data flow, DBT project structure, these are patterns that companies can just use without having to think about how to structure their projects. And then there are platforms which are software as a service. They typically provide data infrastructure for open source management or sometimes they could be closed source like Snowflake or BigQuery. And then there are also a mix of both. For example, Databricks host Spark for you but it also have additional features that are closed source. With this knowledge of fundamentals and best practices, the first two, you could pretty much pick up any tool. And keep an eye out for any improvements in the fundamental concepts area and the best practices area because any changes or improvements to these will usually be industry-wide shift. For example, in DBT, you could see the entire lineage and that's one of the reasons why DBT is very popular.

### Process to learn a new tool [2:47]

When you want to learn a tool, there are a few steps you can quickly take to understand a tool in depth without spending a lot of time. The first one is understanding what fundamentals and best practices it enables and understanding the trade-offs it comes with. And then reading the documents, trying it out, and looking for community built around this tool. Let's look at an example.

### Example: Apache Iceberg [3:11]

Let's assume you want to evaluate Apache Iceberg. Let's first identify the fundamentals and best practices it enables and identify its trade-offs. But as schema evolution, hidden partitioning, layout evolution, time travel, these all seem like data storage enhancement but adding all these features to the data store and it also comes with reliability and performance. It also improves the data transformation. Um and then these are integrations with other system, right? This is all great but there's always a trade-off. Nothing comes for free. If you see grades to the data store there needs to be a system that keeps track of all these so we can reasonably guess that we have to maintain some sort of metadata database to manage this. And similarly, there will also be maintenance. Let's see what that will look like. If you go to concepts, tables, maintenance, you can see all the maintenance requirements here. You need to maintain metadata database. You probably need to clean up old data etc. As you read through any tool, you will quickly see that every tool has its trade-off. For example, DBT CLI, while it's great, it limits you to SQL. While there is added support for Python, it's not a first-class support. SQL is the primary means of transformation in DBT. Now you can guess what the pros and cons of a tool like Apache Iceberg is. And the next step would be to practice it, to try out the tutorials that they have on the website, and then try to figure out if there is a plugin ecosystem around Iceberg, which there is because Iceberg is supported in Python and Go and other languages as well. One thing that is difficult to learn just from online research is nuances of using a tool. For example, if you think about Delta Live Tables from Databricks, you might not know the nuances that comes with using it in production. In such cases, I have found that researching cons on Reddit or using LLMs have been very helpful. For example, this represents Data Live Table opinions and there are some really helpful comments here that goes over some cons that this person has seen in production.

### Conclusion [5:31]

Now you will have a good idea of what a tool is good at, what it is bad at, what the trade-offs are, and pros and cons and opinions as well. If you learned something, please like, share, and subscribe. I'll see you in the next one.
