When I first started in the data world, no one around me used the term data pipeline.
I heard terms like integrations, automations and ETL.
In fact, I am not even sure when I first came across the term. But if you’re a data engineer in this modern era, then much of your time is spent, building, maintaining and keeping data pipelines running smooth.
Even with AI, you’re probably still finding yourself opening up 3,000 line queries, and the occasional custom data pipeline system.
If you enjoyed this video, check out some of my other top videos.
Common Data Pipeline Patterns You’ll See in the Real World - Types Of Data Pipelines You'll Build
https://youtu.be/htAipJ6yYFs
What Is BigQuery - Breaking Down What BigQuery Is And Diving Into A Hands On Walkthrough
https://youtu.be/pud-vuNE15g
If you're looking for help ingesting your data in batch or real time, then you need to check out Estuary - https://bit.ly/4eQC3oQ
If you'd like to read up on my updates about the data field, then you can sign up for our newsletter here.
https://seattledataguy.substack.com/
Or check out my blog
https://www.theseattledataguy.com/
And if you want to support the channel, then you can become a paid member of my newsletter
https://seattledataguy.substack.com/subscribe
Tags: Data engineering projects, Data engineer project ideas, data project sources, data analytics project sources, data project portfolio
_____________________________________________________________
Subscribe: https://www.youtube.com/channel/UCmLGJ3VYBcfRaWbP6JLJcpA?sub_confirmation=1
_____________________________________________________________
About me:
I have spent my career focused on all forms of data. I have focused on developing algorithms to detect fraud, reduce patient readmission and redesign insurance provider policy to help reduce the overall cost of healthcare. I have also helped develop analytics for marketing and IT operations in order to optimize limited resources such as employees and budget. I privately consult on data science and engineering problems both solo as well as with a company called Acheron Analytics. I have experience both working hands-on with technical problems as well as helping leadership teams develop strategies to maximize their data.
*I do participate in affiliate programs, if a link has an "*" by it, then I may receive a small portion of the proceeds at no extra cost to you.
Оглавление (3 сегментов)
Segment 1 (00:00 - 05:00)
Why in the world do data pipelines still even exist? Like we are in the world of AI and yet so many companies I still work with are building Airflow and DBT and other types of tooling into their data stack so that they can essentially extract data and push it somewhere else. And I know some people are going to say Airflow is an orchestration tool, but it often gets used somehow in the data pipeline. So that's why I often reference it. A while back there was a post from Zach Wilson where he kind of called out the fact that if you're a data engineer that still says, you know, you're pushing data from point A to point B and that's kind of your role, that job will probably stop existing in the next few years, especially because of AI. And you know, there's a lot of reason for that. You know, if I go and ask Claude or ChatGPT or whatever your choice of LLM or model might be in the background to build even maybe some more complex transforms or data ingestion components, guess what? Some of them can do it pretty well. Maybe they'll give you some problems and if you can't tell the difference, you know, that's why senior engineers need to now check off code at Amazon, but you know, in the next few years you could see some components of that going away. But I think in order to understand kind of where the role of data engineer might go in the future, it's also to important to understand kind of why data pipelines exist and what they do and why we are still going to likely need them coming in the future. So let's start this video by talking about the fact that when I first started in the data world, guess what? Very few people called things data pipelines. Maybe occasionally ETL, but honestly it was a lot about integrations and moving data from point A to point B, sure, but like really the term data pipeline I actually didn't hear for the first maybe even three to four years in my career as I was building what people consider data pipelines. These often look like various things. I've already referenced things like Airflow, you know, you've got some automated SQL scripts that maybe move data. You've got your ingestion component. You might break it down into things like ETL, right? I'm obviously someone that always likes to talk about Estuary and helping use that to ingest data, which if you need a good data ingestion tool, go ahead and check them out. And actually if you watch one of my recent videos about common patterns, I even referenced how in many ways Excel can be kind of used in a similar fashion to many data pipelines, right? You have to extract data, put it somewhere into some sort of main Excel sheet and then do some transforms on it and now then serve it into something like a Tableau dashboard. So you've kind of done data pipeline. So what is the real reason data pipelines exist and what do they do? So why can't we just do some basic copy into from some sort of raw stage, be a Snowflake or whatever is this your tool that you're using into your raw table and then call it a day, right? Like why isn't that where things end? Here are several things that I think data pipelines do that are very important. One, they provide a level of timeliness, right? Like even when you can't do something, they can do something and we'll kind of go into each of these little bit here in a second, but that's the high level. Accuracy, right? Like they give you some level of checks that you put in there to make sure that the data is correct. Consistency, right? They provide this ability to know that when something happens, when this data pipeline goes, it will do the exact same thing it did the last time and make sure you get the data in the right shape, you store the right data. Actually, this is kind of interesting. This has nothing to do with consistency. I just had this problem where talking about like things that data engineers do and why we're valuable. I just had the problem I was doing my taxes for 2025 and my accountant called out like I sold some gold ETF and they're like, oh, you made money. Like do you know what you bought it at? I'm like, what do you mean do I know what I bought it at? Doesn't Charles Schwab And like, well, you must have bought this on the prior platform. I had bought it on TD Ameritrade. Long story short, that became Charles Schwab and they had lost that information. They just lost it. They're like, we didn't even track the original price you bought this at. It's wild to me. I'm like, how do you I don't know what that means, right? So during that transfer, right? And this happens all the time. People lose information as they switch from one system to another. This is something that often we as data engineers include in our systems, right? Like we try to capture change over time. So things like accuracy and making sure data stays accurate over time as well is very important. Recoverability, we can rerun that pipeline very easily and scalability, right? Like those are all key aspects of data pipelines. We can build a thousand of these. We don't have to manually manage these. I think some other key thoughts to think about also as you're building data pipelines are things like integration. So how does that data actually mesh with other data? And that's usually part of your data model, but it is implemented via the data pipeline, right? That's where you often put a lot of this logic. Availability, I think that's a big thing. Data pipelines make data more available to a broader audience, not just data engineers, but to analysts, you know, AIs, LLMs, MCAP services, everyone wants to do that, right? We've got all these tools that are integrating and wanting to be integrated with. And of course also outcomes. I think there is this goal of data pipelines to not just drive information, but also drive outcomes. And let's kind of talk about each of these individually. Like why is it important, right? I've already kind of referenced like, hey, it's important to like capture data over time because if you have a question about the past, like what price did you buy this ETF or
Segment 2 (05:00 - 10:00)
stock at some point in the past, you want to know that information, right? Cuz that's going to be a problem otherwise cuz it's not like you store it, right? You're not going to be like, oh, I know exactly where I did that. I mean you're probably selling in and out of things. You don't actually remember what you bought it at. That's what the system is for. Another important aspect is integration. If you remember one of the pipelines I talked about the common pipeline patterns, integration was one of the key things I talked about cuz you often have to amalgamate multiple sources into one. You have to build a funnel from five different sources. How are you going to do that, right? If you just put a bunch of raw data into tables, that's not useful. That kind of ties to that availability as well where it's like you're trying to build data that a broader audience of people can use. And sure, maybe you can ask your LLM to just figure it out. Be like, hey, here's six raw tables, figure out the best way to integrate it. That can maybe work at a small scale, but as you get more complex, I, you know, I'm working with businesses that might have 10, 12 different departments that all do different things. Like they literally have like, well, that part of the company sells, you know, product A. This part of the company sells product B, which has nothing to do with product A. And we want to tell like something about our customers across these systems and they use two different CRMs, two different ERPs, two different, you know, everything. Now you're starting to build and build on this complexity and you have to kind of try to simplify it down into a single model or as much as you can, a single model that can then be fed into other systems. So that's where integration comes in. You actually think about how do we make, you know, maybe customer data into one customer piece of information. Now that brings up availability and I also bring up usability, which is the more you kind of process data, your goal is to make it more usable. I give this example in this article where I kind of referenced like wood, right? You've got this initial raw data. It's just lumber, right? It's very hard for anyone to use. I can build a log cabin with it, I guess, but like in order for it to keep being refined and better used, right? You kind of have to keep doing something to it. From raw logs, you have to make, you know, 2x4s and various components of wood and that can now go into maybe a woodworking shop and that can then be refined into an actual product. And then from those finished products, they still have to go somewhere, right? Like the finished product, your dashboard, is not the end state, right? You're trying to take that finished product and then make it have some sort of outcome, right? And then you got all this marketing and all this stuff attached to it to make this product actually part of someone's living room, right? And so throughout that process, you can kind of think of different people touching it to make it useful. You've got the lumber. Someone's got to chop it down and bring it to the mill and then actually mill it and make it usable for someone who's an expert at being like, oh, I can make really good tables. That could be your analyst, right? I can make really good desks and tables and whatever it might be. And then that analyst gives it to the business and that business can then hopefully work with it to make it an outcome that is desirable. But that all involves making each step a more usable component for the next person who maybe doesn't have the skill set or the tools to do the milling or maybe they don't have the skills or tools to do the actual building of chairs, but they might have the skills to understand how to set up a room, right? And so you're just increasing usability and increasing who can actually interact with the data. And that's to me part of the goal of a pipeline is to take that data from this raw state that's hard for very few people to use, right? It could be in JSON, could be in XML, could be in so many different formats. It's not integrated. And now you're putting it into one central place, you integrate it, you make it accessible to a broad range of users, whether that's machines or people and now they can work with it. Another again why is scalability, right? When you're just doing this like Ctrl C, Ctrl V data pipeline from one Excel to another, from one CSV to another and that's your data pipeline, that's great when you have two workflows that do that. At Facebook we had like a thousand just on our team, right? You cannot scale a thousand of those. You know, that's going to be very painful job. It's boring. Someone's going to make mistakes. Like there's a lot of things that can go wrong. And so that's why I put the spectrum of kind of data pipeline. You might have something that's very human interactive and from there it might be more semi-automated where, you know, you might have someone manually run certain parts of the script and kind of check it as it goes. And then you might build these full-blown data pipeline systems that can do thousands of data pipelines, that can scale, they rerun things easily. If they have an error, you don't have to go in there and manually do it. It does it for you, right? And that's very important. Like that you're trying to build a data pipeline system that can take whatever work is thrown at it. Now in this next one I want to talk about outcomes, which I think is really important, especially as we get into this world of AI, right? Where anyone can build data pipeline. You know, this was something that became very apparent to me at Facebook where a lot of our technology made building data pipelines significantly easier than probably what you'd be used to. Maybe nowadays with AI it's different, but like especially the time, right? Like all I really had to do is do SQL, put it somewhere and it would, you know, do all the heavy lifting. There was very little that I had to do compared to what I'd been accustomed to at other jobs. So now I have to be very focused on the outcomes. Like am I even building this data pipeline? You know, we're kind of in
Segment 3 (10:00 - 13:00)
this world where it's going to be so easy to put out code that why becomes more important, right? Because it's so easy to build more liabilities, to build things that will just burn credits and tokens and whatever that have zero value that the whole idea is going to be like, well, what actually drives the business? And that's going to become your value, right? Like I can see how this data could connect to the business versus can I even get this data out of the system? And so I want to give a few examples that we'll put up here. Um so if you're trying to think about what can actually drive value, here's a few. One, reduces unnecessary discounting by analyzing win-loss data and discounts to show where deals close without price concessions. So you end up reducing losses that don't need to occur anyways. Another one, improves onboarding success by identifying which onboarding steps and early product behaviors uh correlate with long-term retention, right? Maybe you lose certain clients, but if you had like just an easy document somewhere you could solve it or at least if you know where the problem is, you could start testing out different solutions to figure out, hey, maybe we need some sort of widget here or some doc here to make sure that as our end users are working with the product, they be find it more valuable. Reduces support costs by linking support tickets to product events to eliminate the root cause driving repeat issues. And then the last one I have here is increase retention through proactive customer success by learning CS teams when usage drops or support volume spike. So, trying to think through, you know, what would make your product more useful. These are all very digital, I think, or a lot of these were very digital examples. There are other opportunities that you have that the way you should be thinking and talking is in that way. Not I built a data pipeline via Airflow, it's like that helps reduce churn, that helps uh increase income, you know, those are the thoughts that you need to start having. Timeliness. So again, like sometimes you have pipelines that need to land at like 6:00 a. m. in the morning because you're in specific time, there's a board meeting at 8:00 a. m. their time. You're not going to have time to be awake or you don't want to. Like who wants to be awake to like make sure that data pipeline lands? I don't, you don't. So, a big part of that is making sure you have these systems in place so that at the right time the data is in the right place. And then the last point I'm going to cover cuz I've already covered accuracy and a few other of these points is recoverability. We kind of talked about this with, you know, being able to rerun pipelines in the past. You know, I've talked about this with backfilling, right? You want to build systems that are easy to rerun if a failure occurs, if bad data enters the system, how will you rerun that data? And I think that's an important aspect of building data pipelines. Is you need to have a way that you can rerun all of your data. And so, recoverability is an important aspect of data pipelines and these are why we build them, right? And the whole point of this video is to really cover that why. Like why do we build data pipelines? And hopefully you kind of understand so that as you're building them and as you're wondering where is my job going in the next few years, you understand what you're actually doing, not just the technical part, but the business part. And then you can start making more connections, thinking about those outcomes, and figuring out, okay, now that I have those outcomes and I'm thinking about them, how do I really start building the right data pipelines, not just any data pipelines. Hopefully you found this video helpful. I'm again doing a whole series on data pipelines, so I have more planned. Next one will probably be uh full refreshes versus incremental loads and those kind of decisions you have to make. I also have some thoughts on the AI videos coming up, like what's going on. I'm literally working on systems that involve it and it's been a lot of fun interacting more with that side of things, but it's also definitely made it harder to make videos. So hopefully you enjoyed this video and I will see you all in the next one. Thanks all. Goodbye.