Spark SQL Workshop: Advanced join & group by techniques

56:29

Spark SQL Workshop: Advanced join & group by techniques

StartDataEngineering 21.06.2025 1 141 просмотров 67 лайков

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Описание видео

Code: https://github.com/josephmachado/advanced_spark_sql_for_data_engineers/tree/main Full Course: https://josephmachado.podia.com/advanced-spark-sql-workshop-for-data-engineers Feedback Link: https://form.typeform.com/to/f51flAI1 1. Date & Time June 21st, 2025 1:00 PM - 2:00 PM EST (10:00 AM - 11:00 AM PST) 2. What You Will Learn * How to use JOINs to validate data and identify underlying data issues * How to use advanced aggregation functions & check data quality with GROUP BY 3. Who This Workshop Is For 3.1. Prequisites: * SQL basics, especially JOIN & GROUP BY basics (see basics here) * Basic understanding of fact and dimension tables * GitHub codespaces or Docker compose (if running locally) 3.2. Perfect for: * People with some experience in SQL * People who work with SQL regularly 3.3. Not suitable for: * People who don't know SQL basics, especially JOIN & GROUP BY basics (see basics here) * People looking for topics other than advanced JOIN and GROUP BY techniques 4. How to Join * Format: YouTube live workshop with hands-on coding * Participation: You are expected to code along * Interaction: Live Q&A session included * Practice: Exercises provided

Оглавление (1 сегментов)

PM - 2:00 PM EST (10:00 AM - 11:00 AM PST)

joined with customer table multiple dimension here we are just using one dimension and group by certain kind of dimensional column so we use year and ID and name of customers and we present some numerical number that's like the standard uh analytical data um pattern. So that's why joints and group buys are crucial. While I wait for this to start up, I do see a question from Tommen. What's the advantage of using strictly SQL versus Pispark Spark data frames? Is it just purely preference-based? Um there are a couple of um concerns. SQL is more widely used because it's understood by more people. However, if you want a more strict approach to programming with things like testing uh encapsulation, etc., you would generally use spicepark. In my opinion, one of the biggest benefits is that when you use spice spark data frames or spark data or polars or pandas data frame, you get intellisense in your code when you type code. So, that's really helpful. You can also have uh type hints in Python specifically and it's easier to test. Um and it's easy to kind of separate out code so you're not looking at this like giant SQL at a time. Um so there are some benefits of using kind of a data frame based approach. However, SQL is also quite prevalent because it's I mean a lot of people understand SQL and it's also well established. It's been there for decades now. Um and most times it's a mix of both. It's not purely SQL or purely data frame. Typically people do like simpler ones in SQL convert it to a data frame and send that as an output of a Python function. So it depends on the team but usually it's a mix. Okay. So let's go ahead and open our Jupyter notebooks. So please click on ports tab here and you'll see this 888 and if you click on this kind of open in browser button next to it that will open a new um tab. Uh this is going to be our Jupyter notebook where we are going to run all our um exercises. So go ahead and open notebooks. Oh sorry go to notebooks and click on advanced join group by ipython notebook. Uh and not the solutions one that is for yeah solutions for you know when you have time to look at the solutions. Um all right so let's go ahead and set up the data. So basically what we want oh actually let's open it in Jupyter lab that's a little bit um better and if you click on this um icon you can have a table of content which is quite helpful here. So let's go ahead and set up data. These are basically some scripts that create TPCH data and generate um the tables that we're going to use. And this is uh installing an extension which uh will kind of use to display the images. So let's go ahead and run these. All right. So we saw what fact tables are, we saw what dimension tables are, and we went over the join types and the kind of like a standard join query here. Uh but let's kind of look at a pictorial representation of it, right? So what's happening here is you have a fact table called orders. You're joining it with customer table which is a dimension and you get some results. Um, so you match, you group by and you present some sort of result that might be used to power a BI dashboard or be sent to an Excel. That's like the standard uh data pipeline um operation. However, in addition to the standard use cases of joins and group by they can also be used to uh find issues with our data set. So for example, joins specifically can be used to find differences in data set. You might have always used joins to be a way to join data sets. But there are techniques you can use by which joins now actually help you find differences. So in this workshop we will cover that mostly because I assume you know the standard ones standard joints. So the first one is called an anti- join. Uh this is basically if you have a problem or logic where you want to get rows that are present in one table but that are not present in another table. How do you do it? Right? I want to take five minutes and give you guys five minutes to kind of try these problems out for anti- joint. The first one is let's assume you have an orders table and you have a line item table. Each line item should be associated with an order. But assume there is some issue with the injection and you have to figure out which are the orders that don't have any line item. How would you do that? Um we have a query here. I'll put like 5 minutes on the clock. Please go ahead and kind of write your query here. You should get uh these two order keys 02 and 04. And if you have any questions, let me know in the chat. So there is a question asking with anti- joints do we care about columns from the table as well? Um I'm not 100% sure what you're asking but if you're asking do we care about columns from the other table in our case line item table usually no because we are only uh looking for table or rows that are present in one table that are not in another table. Um so that's basically what it is. Anti-join will only give you um in this case the order uh rows from order table. It will kind of ignore everything from line item table. Does that answer your question? Just as a hint, there is a join called anti- join like literally anti- join just like left join inner join that is anti- join in spark. So you can use that to write this query. And don't worry if you can't solve it. We'll also have the solutions um with everything solved so that you can kind of do at your own pace. I'm just going to keep that open on the side just so you know there are solutions so you don't have to worry if you're not able to do it right now. Let me know if anyone completed this and uh you're able to get 02 and 04 auto keys as the results. Oh, the code spaces does take a while to set up. Um, that's fine. I mean, you don't have to um kind of write the code in here. I guess you can just imagine how the join query would look like. That works too. Okay. Um, times up. But basically the idea is literally this where we are saying hey you know you have orders give me everything in orders table that are not in line item and we match by order key. So what this will do is look at every row and orders and give you the rows and orders whose order keys are not in uh line item. This is similar to a notin clause but we want to use anti- join because this is a simple join. When the join is more complex an anti- join will be really helpful uh compared to a notin clause. All right the next type of join is called a as join. I have a question ter. So basically just a left join. Yeah, that is basically what it is do. So here I have an alternate approach as well. It's a left join uh with order keys null and again uh one of the benefits here is just a simplicity of usage and the main pattern here is just knowing that this type of join exists. Um the next one is called an as of join. Um so if you want to if you have two fact tables and you want to get the data from one table as of a specific time in another table. So this is very important. This is as of specific time um based on the closeness in time right. Usually when you need to get the late this is used when you are like matching two facts by time of some um sort like when did this happen versus what was the state of this other table at the time that something in table A happened. Most people call this point in time join as well. So let's look at a quick exercise here. Right? Let's assume you have a stock table and you have a price tracker table. And stock table is basically the stocks that you bought on the date you bought it, right? So there like symbol like Apple and listing date is you can assume that's when you bought it. And then you have a price tracker table where you can assume every day there is some sort of system that logs the pricing, right? Um, hold on one minute. There is a issue. Give me one minute, please. Uh, Is the screen visible now? Are you able to see the screen better, Ragdev? Is anyone else having issues with the screen? Um, clarity. Okay, great. Yeah, sorry about that. Not clear what was happening. Anyways, as I was saying, as of join is used to um yeah, okay, there we go. As of join is used to identify when a certain event happened typically. So in our case, let's say you have the stock table which represents your stock that you bought in a certain date and then you have a price tracker table where you have the price listed for every date. This join will allow you to kind of figure out what the price was when you bought it. So for example, right in this Apple case, this is sorry in this Apple um case, let's say you bought it on 2024 115 and you have the Apple price tracked for every day. You want to pick the first the price closest to your date. So we uh luckily have 2024 1110 being tracked. So that's what we want to join. We don't want to join on 120 because that's after our purchase date. A way to do this is using something called an asoff join. However, Spark does not have that natively. Things like duct DB do. Um and in this case, what you would typically do is a join like a standard join, but you will have something called a range join. So instead of doing a direct join you will identify all the listing before all the price tracker information before your listing date and then you will use window function. So you will say like hey sort by price date in descending order and give me the latest one. So a combination of joins picking all the price tracker before the date you bought your stock combined with row number function and tracking to one will give you um the most current uh the most as of time information. This is a bit complex. Some DBs again have uh as of join. So things like duck DB uh I think has it as well. So if you are ever faced with a scenario um take make sure that you know your DB does not have as of joint before recreating it in with window functions. I'm going to skip this exercise for now because we are running low on time. Um the fine the next one is joins represent um also check on foreign key relationship. So what does that mean? That means you let's say you have a warehouse system and you don't control the ingest times right and you have like multiple tables landing at random times at different times. If you do an inner join you will end up with uh losing data because a table A might have arrived in like 5 minutes whereas table B might arrive 6 hours late. meaning the rows that are um present across both will not be available if you do an inner join. So let's look at a simple scenario. Um I wanted you to think about these things and it's not necessarily a query thing. It's more about thinking through this problem, right? Let's say you have a warehouse and you have two tables order table and customer table and order table is ingested every 5 minutes whereas customer 6 hours. How do you ensure uh your joints between orders and customers don't lose data and is as expected? Um so if you have any questions or thoughts, let me know in the chat. It's more of a scenario type question. It's not a straightforward answer. Um so yeah, let me know. I'll put like five minutes on the clock. So again the question is if you have two tables arriving at two different frequency into your warehouse and you do a join on those you are going to lose data. How will you approach this problem like and how do you think about table that arrive with different frequencies? So Tommen says you need to ensure you have some sort of table load date column for each table and then join that along with any other relevant keys. Let's think through that. Right? So you're saying we have a table load date. Um so that's a common pattern. So we can have something like um ETL inest date. Right? And uh you can say like only give me rows that are available at that time. Hm. I'm not sure how that would work, Tom. And you can have an ETL ingest date or insert date, right? Could you specify that join that you would do? Because even if you have the ETL specified date, um you cannot you will still lose data uh if you do an inner join. Um oh the next question is sorry if I'm mispronouncing your names gun is this a sedd um let's assume that's a really good point if you have an sedd um which for people who are unfamiliar it's uh slowly changing dimension I'm assuming you're talking about too that that's a good point where you can specify uh the exact time period for when to join that will give few correct joints. However, the same issue remains, right? Orders will have more data than customers. So, if a customer was created like an hour ago, they will still not be in the SED table, right? And they might have made an order. But now, if you do the join, even with an SED table, you won't be able to join because you will it'll result in no join. Andy asks, "You will need special logic to handle orders without Yeah, that's correct, Andy. You will need special logic. " Um, I want to get to that. That is basically the answer like what this special logic is. And I'll get to that after the next question. Uh, Sha asks how to predict which column we have apply Windows function for any general. Okay. Um I do cover window functions in another boot camp but not on this one. Um window functions a little bit tricky. It needs a lot more time uh to explain what it is. Um so coming back to your point Andy you do you are right. Special uh yeah you're exactly right. Special logic does depend on the business requirements. So let's take a look at this. Right. This is basically the kind of mermaid diagram right? You have auto data stream that comes in 5 minutes. Customer data time comes in six hours depending on the join. If you do inner join, you will lose customer data. You will lose orders data as well. And that might be fine for your reporting use cases, especially with things like finance type stuff where you need more specific numbers. That might be fine. However, there might be reporting use cases where directionality is more important than completeness. In those cases something like a left join would be better. So if you see the top here you can see customer uh name is null. So it's like that that's fine. Some in some cases that's totally fine. Uh ideally it's like or typically think of it in one of three ways. You can do a left join with nulls being tagged as unknowns. Meaning your orders will have a customer ID that is still not present in the customer um dimension table. That's a possibility. You can just mark it as unknown. And this is like from uh Kimal style processing. The second option is to do an inner join where if you need more complete data, more specific data, you typically want to do inner join so that you only give complete data. Then there is the final pattern where you do an inner join or left join at the moment but then you reprocess the same time frame of data at the end of the day. So it's like multiple processing. Um that's what companies typically do when you want to have both kind of like a fast data and also complete data later in the day. So typically you have like a pipeline running every hour and at the end of the day you would just like process the entire day's data. It is common but it's also quite tricky because if I'm a user and I see this report or whatever every hour and then suddenly the next day numbers are significantly difference it causes lot of chaos. So that is a trade-off that you have to think of when you're using this multiple pipeline approach that modifies the same data set. Um so please be mindful of that. So Sep says you can use something called change data capture uh to store data in the staging area and apply logic for SQL from staging. That's a really good point. Um so for people who don't know change data capture basically captures every change in your upstream data. So like every um create read or sorry create update and delete in your customer table will be logged as individual rows in the CDC data set and we can use that to model like sedd2 or more recent uh give a view that gives you the most recent view of a customer data. So that is an option we can do. Uh but in most in some cases depending on the company you work for different teams control different parts of the pipeline. So there's like a typical kind of ingestion team that dumps data into our warehouse. Sometimes it might be that they haven't had that chance to do CDC. Um so you can use CDC if you have the ability to use it. That would be ideal which means like more complete data. Um but if you don't have CDC you have to do the sort of like uh tradeoffs to figure out what to do. Uh yes Sha we can use change data capture in any data warehouse. Um change data capture it's more of a pattern uh than a specific tool. So you can if you're ingesting data into big query you can do change data capture. Uh typically you want to have like a batching mini batching mechanism before you insert into a warehouse. So like dump wait for like a some sort of thousand rows or five minute timeline and then dump the data. That's just so you avoid uh creating too many small files but I won't go into that more than that for uh this part of the workshop. Hello from SQL and healthcare department. Oh the SQL and healthcare is kind of heavy on uh the privacy stuff. So I worked a little bit with those. So it's uh pretty intense with the privacy HIPPA compliance and stuff. So good for you Nikant. Any other questions. Okay. All right. Let's move on to the next one. So we saw how we can use joins to get data not in a table, get data as of a specific time. Um and then finally how joins kind of expose underlying differences in data and also excuse me so how to deal with those differences. Then finally, there are some kind of common caveats that we need to think about before we do joints. So the main the biggest issue I've seen is people use tables who have multiple different grains. So this is a issue with data modeling. If your data model is incorrect or you have multiple grains for the same table, um your joints are just going to blow up and it will not work. So, make sure your tables have a single grain or at least filter them so that they do have a single grain before joining. Excuse me. Uh, we just went over handling slow and fast data joints based on use case. It's a complex one. It might seem simple at first, especially the path where you have to re reprocess the data multiple times. That's a tricky one because if you send out data to a stakeholder and it changes later, there will be a lot of questions. there'll be confusion some cases it's pro it's illegal especially in finance cases um so yeah please be very careful with that then there's the issue of null key joints so let's say you're doing some join and in this example right on ID and some value but if the value is a nullable value meaning nulls are allowed it will not join null literally represents an absence of value it means from a data modeling perspective that there is no data. It doesn't mean anything. It just say it just means there is no data. So null typically cannot be matched with null. However, in Spark and some other data warehouses, you can do a null safe join with this kind of special format. I think it depends depending on your database. But in Spark SQL, it's this. In spark pispark, it's equal equals. There is a function for this but it should be a last resort because if you're joining on nulls that usually means your data model is not correct because nulls represent absence of data. You should have either like replaced it with something like unknown or things like that. I typically tend to stay away from null join because it causes a lot of confusion where later on if you need to debug your pipeline especially this combined with like reprocessing pipelines that we saw before where you are reprocessing the same time um data for the same time it'll cause a lot of confusion. So please be mindful of that. And then finally, uh the last point, it's not necessarily incorrect, but it causes issues with your database engine not being able to uh optimize your query plan. If you apply functions as part of your join criteria, what happens is any possible optimizations such as uh partition pruning, um figuring out clustering, etc. will be avoided. Right? So if you have like this is a very naive example lot of data warehouses handle this specifically spark and iceberg but if you have something like your function in your join criteria sometimes the database engine might get confused and not apply any optimizations that it would have otherwise applied. So always try to kind of apply the function before your join like a CTE or a subquery and then do the join. That way you're not kind of ignoring any optimization um patterns that your DB engine might be applying. So uh this is a simple example where we create two kind of naive CTE ID1 some fu value ID2 some null value similarly ID1 bar ID2 null. If we use just an equal to this will not give you any results. However, since we are using the null safe equality, we will get none, which uh might be okay for your use case. My typically I tend to stay away from it, but this is an option we have. Excuse me. Before we go on to the next section, which is group buys, I'll set like five minutes uh for any questions you have. Um all right there is one from Nelly I want to learn SQL for healthcare department and pharma department basic and um so I think that that's a interesting question ningal I think it's I almost think of it as two separate things right one is SQL the other one is health and pharma department in my opinion what's more important is that understanding the business processes because I know the health and pharma is like super complex a lot of human involvement. So a lot of cases for uh data errors. So I think in my opinion the first case would be kind of understanding what data you're working with that and how that data is generated. That is more important than kind of like the hard skills of SQL um as a first step. Um Ash asks what database do you suggest to store time series from IoT devices? Um what's your data scale actually? If it's a few hundred rows a minute, Postgress would be fine. But if it's like a few million rows an hour, then you might have to think about something different. Oh yeah. Million rows an hour. Okay. Yeah, Postgress will not work for a million rows an hour because if you have hour, that's 24 million in a day. That quickly goes into um trillion whatever it's over a trillion a day. Yeah, 8B a day. Yeah, that's probably not a good idea with postgress. If it's 8 TB a day, uh what type of analysis are you running on it? Uh would be a question. But a typical pattern that I've seen a lot of companies use for analyzing large amounts of um similar data which I'm assuming IoT well uh formatted data is typically click house. Um so or druid more popular one is click house. So those are things that are optimized for storing large amounts of time series data. So what they do is they do pre-process your data. That way when you're doing aggregates it's already ready to go. So you can almost think of it as like sort of like a materialization uh pattern that they use. But yeah, Click House is what I would recommend you look at. Cool. Is there any other question? You're welcome. much. Are there any other questions? All right, seems like none. Um, is anyone having issues with code spaces setup? Oh, Andy has a question about my paid course. Oh, yes. Uh, thanks Andy. I am releasing a advanced Spark SQL for data engineers. So, that's specifically aimed at people who want to uh one of three things. get a lot of familiarity with how to use window functions and like what are the nuances when to use them when not to use them how it impacts performance and then the second one is um what was it I forget let me open that the second one is spark optimizations so we'll learn like how distributed systems work and how specifically spark distributed system work and how you can optimize your jobs based on Spark UI query planner and how to spot like bottlenecks and within your Spark processing uh things like that. Um it's I would say anyone under like five years of experience or less years of experience but don't or more not really super familiar with like the spark optimization techniques like how to do partition when to do partition what is the difference between partition and bucketing what are some techniques how do you sort it why how is data stored um etc. So it's a little bit more on the technical side for sure. Um and also like covers like how to write item pot and pipelines. So one of the key things that I have noticed that my job and all my in my previous decade and almost a decade is like when you run pipelines multiple times which generally happens when you have to do like back fills and such there's always some issues. So one of the topics I'll be covering is in the paid workshop is how to write item potent pipelines. That way you know even if you're rerunning pipelines like the our case with this uh kind of reprocessing joint pipelines we shouldn't create duplicate data. We shouldn't create um partial data. So those three things um so okay TLDDR it's meant for people who are looking to up their um spark SQL game writing better window functions item potent pipelines spark optimizations so I yeah and uh you can take a look at the course link that I will post in a bit I hope that answers your question Andy I went on a long rant um Nelly asks When take SQL healthcare certificate I do not plan to do a specific healthcare course neton so sorry I I cannot get access to healthcare data so I'm not going to do that is asking about country I'm from job situation um upcoming speculations yeah I'm not going so I'm from I'm uh in the US in the New York area. The job situation is not going to like super rough right now. Uh there are not a lot of openings. Uh even the ones that are open are expecting quite a bit. Um so you do need to put in a lot of effort. Um and then your question also includes question about upcoming job market. Um well I have about 10 years of experience. So take that into account. I have been getting more recruiter reachouts recently in the past month and a half. Uh but it's nothing compared to what it was pre like you know during the co time. Um it's it's rough going. One thing I would say is go through recommendations if you have any or try to find people who would refer you cuz that is the most easiest way. Um I did like three interviews last year. I got all three through referrals. Um, so yeah, I think referrals is the way to go. If you're applying, there are like hundreds of people applying. It's hard to get through. You need to kind of circumvent that somehow. And referral is in my opinion the best way. And also be ready for your interview, right? Like um Python, SQL, some data processing, how to how partition works. I the I would be surprised at how many people don't understand partitions uh with like highly technical roles. Um I hope that helps. Sha, do you have any other questions with that? What does my dayto-day look like as an expert? Well, thank you actually for calling me an expert. But uh mine so it depends on the company. Uh I worked in like companies where we didn't have data platform and I was setting up the data platform. I worked in companies where data platform is like well established and I was building more like pipelines and additional tools. right now at the level I am at it's more about excuse me planning so and kind of making sure any work that is being done is actually helping um the stakeholder. So things like conducting user surveys making sure um you know for the next 3 months we have some sort of project that can deliver um measurable outcomes. So that is honestly most of my day. I do write code. I would say about like 50% coding, 50% kind of project management type stuff. All right, we are running low on time. So let's go on to the next part which is group buys. Typically you use group buys to kind of produce some report, create a metric for your um for your downstream consumers etc. But group eyes can also be used for validating assumptions about your data, right? And also creating some nested data structures and finally creating Excel like pivot table style reporting. So let's look at that. The first use case of uh group by besides reporting and standard use is checking distribution of data. So when I'm exploring an input data set, I always try to do this to see what if the data makes sense. So in this case, right, I want to set like five minutes for this. I want you all to kind of run this query and tell me what you think from the data, right? Uh the data I think my is saying, but basically the data will look something I don't have it here. Yeah, you'll just get your nation name and number of customers for that nation. look at that data and let me know what you think about that data set. Does it make sense? Is something incorrect with the data? Um, just does it make sense? It's like 5 minutes on the clock. And please let me know if you have any questions. I think something is happening with my Okay, that's why while we wait for the spaces to start. Let's continue here. Right. Did anyone get an answer for this group by? And what do you think? Um please let me know in the comments or chat. All right, something is off with the cold spaces stuff. But basically what you can see here is when you run this query, you will see nation name and China has will have the same kind of number of customers as the rest, which should not make any sense because of the number of people there. So those sort of uh kind of issues are what you typically want to look for. All right. So, let's go look at what this would look like, right? Okay. While that's happening, let's go on to the next one. Again, you don't have to do this right now. We can always do it when you kind of get home. Right. So, there is something something's happening. It's fine. Yeah, I think it restarted the whole kind of machine. Not sure why, but let's keep going. Um, the other use case you would use group by is to identify primary key constraints. So, if you have a table and they say like, hey, this key is the primary key, you would can use group by to validate that. And here's how you do it, right? So you can do a group by your primary key and do a count star with the having. Um so what this does is it'll group by your primary key or supposed primary key and figure out if there are any primary key row that are repeated. So that's what this having does. So that's a common use case of group by as well. Cool. Let's go to the next one. So we have all heard of like the standard group by functions which is min, max, average, sum. But in addition to that there's also like more type of group by functions that not a lot of people use. Those are like statistical function collection functions approximation function and convenience functions. So statistical is like if you want something like standard deviation or correlation you would typically um have those there. And then there's like collection function. So those are things like array aggregation um collect all the rows into a single set of rows. You can use this to create nested data uh types which are things like lists um dictionaries etc or call strrus. And then you can use approximation functions. So these are basically like count distincts but so much faster because just count distinct is really slow due to how it's being computed in a distributed system. approximation functions give you relatively right answer but it t it's so much faster because it uses like a um probabilistic data structure underneath it that makes them so much faster things like uh hyper log um etc and then there's the convenience aggregate function so let's take a look at a convenience function right so this is like a typical kind of aggregate function where we are saying hey if order priority is low count that um otherwise don't count it. That's what this kind of complex logic represents. However, if you look at your database um uh documentation that you might have something like count if. So this can be represented simply as this and that's just one u of the examples, right? Then there are more like is any are any of end rows um true? So you can use things like boolean and boolean arth specified here. So those are the four types of aggregate functions that I'd recommend you think about than just the standard ones. Um the idea here is instead of recreating complex logic in your code, always make sure that your um database documentation does not have uh check if they have any of these already available. Cool. And then the next one is uh I'm not going to cover this because we are running low on time but I essentially the idea is like you can combine multiple rows into one and that's basically the nested structure right here. You combine multiple line items into a list of strcts which you can think of it as like a list of typed dictionaries. In addition to that, in addition to standard group by you, you have also have these thing called group by variations. So three key ones are cube, rollup and grouping sets. If you are familiar with pivot tables in Excel, cube is similar to that. Basically if you say cube region category on the on here you can see like the exact group by unions that you would end up doing. Similar with rollup. Rollup is a little bit different because if you say region category it'll do like region category region and nothing true. So we'll see go over techniques on how to make your pipeline item potent and then finally optimization. What are the key concepts of optimization? How distributed system data processing works? How do you interact with distributed storage and how do you use like query planner and UI to optimize your data process processing code. Um and then finally storage pattern. So what's the difference between partitioning, bucketing, uh why sorting matters. So I think that if you know like the basics of SQL and or data frame as well and you want to kind of take your skills to the next level, that's what this is meant for. Um and this will all be done in Spark SQL. Um and you know, you'll get the code and everything. And yeah, it's next week. Um it says five four hours but don't worry there'll be like a 15-minute break between each section uh format it's led by me uh where there will be a meeting link that I'll send uh if you register and then it'll be recorded and available through podia runnable code and kind of community Q& A. Um yeah I hope that helps actually. Um the next question is AI in data engineering. H that's a good one. I have been playing around with it and have used it a little bit for work. The main area that I'm seeing it is first of all writing SQL for non-technical users. So when a nontechnical user has a question, you send it through AI, it'll generate a query and give it to them. They can run it. Um the next thing I have seen is also writing documentation. People are trying to do that, but it's kind of terrible in my opinion because all it does is literally just take the code and say this is what's being done. However, the true value of a documentation is giving context. Anyone can if I'm reading a documentation, I can I would rather read the code to understand the logic. The documentation is there to give the context as to why a certain thing was done a certain way. Um, LLMs don't do that yet. So, it's a little bit um of a let down. Where I do see it being significantly used and very powerful and I use this all the time is all with the unit testing. I tell it how to test it. What are the tests to run? It just generates a bunch of unit test code. That is where I've seen it work really well. Everywhere else in my opinion, and I have 10 years of experience, so take that in mind. It creates me more work than I would have done if I hadn't used it. Besides for testing where it's very helpful or like repetitive code scaffolding um but it doesn't do anything innovative in my opinion so far. Uh it might change in the future which I don't think so but yeah I hope that answers your question actually. Tom asks oh you are just curious. Uh yeah, the cross apply people don't generally use it as much. Um people I mean in real in in bigger companies as well, right? People try to keep it as simple as possible. You'll be surprised at how scrappy a lot of the big companies are. Um and how a lot of people don't fully know all of these things that we are going over as well. So yeah, I hope that helps. Um any other questions? We are little over time though. Uh yeah, this is mostly going to be on the US time zone. I don't have it specifically for Indian time zones. It's going to be like an afternoon. So I'm not sure if people from India can actually join. Um yeah, I do plan to kind of have it as like a um recording. That way that'll be like a separate recording because the in the uh online workshop I'll be um staying online after the workshop for a few hours to answer any questions as well. Any other questions? Okay. If there are no questions. Oh, yeah. Thank you for attending. I appreciate your time. You know, it's a weekend, so I appreciate it. Um, thanks, Domin. All right, I'm gonna end this if there are no more questions. Thank you, Sundep. Thanks everyone. Bye-bye. Um, sh I don't have anything specific for GCP. Um, but all of the content that I try to put out is should be kind of tool agnostic. So, Spark tool agnostic. I mean, you can use data bricks, you can use like I don't know like EMR or whatever you want. um SQL this is also tool agnostic um I know uh BigQuery has a lot of these there's a separate repository called BigQuery utilities that give you access to all this stuff and more um yeah and the and when I went over kind of this part where we go over item button pipelines query patterns these apply to every all data warehouses not just uh specific to BigQuery so Yep. Thanks, Josea Nildo. I hope I'm pronouncing your name right. All right. Thanks, Sha. All right. Thanks, everyone. Bye-bye.

Другие видео автора — StartDataEngineering

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник