# How to transform data using Data Flows

## Метаданные

- **Канал:** BeardedDev
- **YouTube:** https://www.youtube.com/watch?v=HTfoSGe5EYY
- **Источник:** https://ekstraktznaniy.ru/video/44476

## Транскрипт

### Segment 1 (00:00 - 05:00) []

hello and welcome to my channel i am vdev and in this video i'm going to be talking you through how to perform transformations within data flows now we have a simple file within data lake storage that i've also already set up a data set for and it's a file that contains orders and customers as well as what products and the amount of spend now in this video we're going to be talking about how we can extract this customer information to actually build a dim customer table now we're not going to be focusing on performing slowly changing dimensions within the data flow but that is a video that's coming next week on the channel so keep an eye out for that and then we have a target table which is our dim customer and if we just have a look at that there's no data in there at the moment but we can see the attributes so we've got customer key first name last name and then we also have our valid from valid to is valid and some created and modified dates there ready for when we perform that scd processing in the future so to get started with data flows uh we're just going to click on the ellipsis and click on new data flow which will bring us to this window now the first thing we're going to do is just give our data flow a name it defaults to data flow 1 and i'm just going to call this df underscore customers now the thing about working with data factory or synapse pipelines is you need to have your naming conventions already in place as we go through this you'll see there's different transformation steps you can have data flows data sets pipelines triggers and each one has to have a name so there are some good articles online about the best practices for naming conventions so now that i've given my data flow a name i'm just going to close down the properties window and i'm just going to hide our factory resources pane just so we've got a clearer screen so we can see what we're doing i have come across a few issues with the display of data factory i find that google chrome tends to perform better over microsoft edge and we'll probably see some of those issues we can come across in the future i am using my laptop here with the 14 inch screen as well so if you have a bigger screen it might be easier so the first thing we're going to do is add our source of data and again we're just going to give this a name as source customers so each of these little boxes here on screen is known as a stream and we'll see as we move through the transformations we'll usually have a an incoming stream and then a name for the stream that's receiving that data within our source we're going to specify our data set which is going to be our data lake file and once we've got that to add transformation steps we can simply click on the plus button now i know from working with this file i have i think one million rows within the file but only a small amount of unique customers and the best practice with working with data flows whether within data factory synapse pipelines or any other etl or elt tool is to just minimize data as close to the source as possible because we're also working with a csv file in this case we actually don't have data types defined within that file that's something we're going to look at as well but on screen now you can see the different transformations that i could do i can join data sets i can split check if the data exists in certain data sets then we have our schema modifier category where we've got our standard transformations such as derived columns or aggregates which we're going to go through the video as well as some others they have some formatters some row modifiers flowlets and then as sync which is actually going to be where we're writing data to now as i mentioned within the file there's a lot of columns that we don't actually need to use and as i mentioned as a bex practice it's better to minimize that data as close to source as possible so one of the options that we could use would be a select

### Segment 2 (05:00 - 10:00) [5:00]

which just allows us to select columns from a source that we're going to be working with so we can specify our columns in here perhaps we're not working with our order details and we can simply remove that data so it's quite straightforward to do but in this case we have um duplicate customers like say there's a million rows in the file because it's customer orders so we're looking to actually return the distinct customers so because we're looking to do that i can actually skip ahead instead of using select and actually just choose our distinct customers now if i go back to the data flow just delete this step and add a distinct transformation we can see that distinct isn't actually available in the list and instead we need to have a look at using aggregate so i'm going to add my aggregate step i'm just going to call this distinct customers i'll expand that and then in my group by columns here i'm just going to add in well select from my list the columns that i'm actually going to be working with that are relevant to a customer so we're going to choose customer id first name and we can type in the box if we're working with a large amount of columns to filter what we actually need so we want last name date of birth home address line one city and last of all postcode so those are our customer columns that we're going to be working with now the thing is with using aggregate over distinct if we have a look at validating this so one problem is we haven't added our sync transformation destination as of yet but also because we're using an aggregate function we need to have aggregate transformation we need to have at least one aggregate column so what i tend to do in this situation is just add an aggregate that really doesn't mean anything we're not actually going to use it but i've got an order line id here and i'm just going to enter a simple expression of count so count order line id and that's effectively just to resolve that error we can see if we click on validate again that has actually been removed and like i say we're not going to do anything with this it's just so we can validate our data flow i'm just going to collapse that down again and what i do like about working with data flows is we can actually add our sync in now and it's actually going to tell us the problems with our data flow so we can see on validate that our problem at the moment that is that there's no sync so if we was to just add in a simple sync and data flows can be as simple as a few transformation steps it doesn't really matter so we're just going to add in sync customers we've got our incoming stream as distinct customers and then we can add in our data set dim customer we can go into the mapping tab i'm going to turn off auto mapping and uh duplicate input and output i'll just expand that now i know customer key is a surrogate key within this dimension table so we're not going to be doing any mapping with that that's going to be auto-generated but this is what i mean about having a look at what the issue is we can see that some columns are automatically mapped for us but some aren't and in general when that's the case like date of birth it could match on name but we can see if we add that in we're going to get a warning to say there's actually a problem with the data types there um so we can add in date of birth our source id is actually going to be where we're writing a customer id because

### Segment 3 (10:00 - 15:00) [10:00]

that's something we're going to be looking at in the future for the scd processing and there's a few other columns in there that we haven't got mapped at the moment now we do pass validation in terms of data types but this wouldn't actually run and this is what i mean about we can sort of build this backwards by taking our data set passing it to the sink and now we need we know we need to do further steps and that's relatively straightforward to do so we know we've got a problem with date of birth and customer id in terms of data types to resolve that we're going to go back to our data flow and we can actually add a step in between what we've just set up and we can add in and a derived column transformation can be actually be used for a number of things so in this case we're going to start off using it for converting data types we could also use it to add new columns which we're going to be doing shortly or we could simply apply expressions to columns that we already have so let's add our derived column transformation and let's call this convert data types we'll just expand this so it shows on the screen better and the first one we've got is date of birth now we have here this expression builder as well which is available on all different transformations and this is something that doesn't display on screen well but what i'm actually looking for with date of birth is to convert it to a date and we have this search box here we can see our functions and our input columns so we can see if we search by date we can see all of the functions or expressions that relate to date and what i'm going to be using is simply to date we can click on that that's added to the expression for us and then we can add in a date of birth column so we're actually converting that there now we're going to be doing a similar step with customer id so again let's open our expression builder and in this case we're going to be again converting the data type but we're going to be looking at something for an integer so in this case we're going to be converting it to an integer the customer id column save and finish we'll just collapse if we go back to our mapping expand out we can see that warning has now been removed so that's one thing we can do with derived column transformations i would say that's probably quite a powerful transformation and something you'd probably lose use quite a lot if you're working with data flows now the next thing we need to do is add in some columns for our validity periods and our created dates now we need a current timestamp so for our valid from our created date and our modified date they should all be the same date in this scenario uh we need a column for our is valid flag and then we need a end date for our valid two so i'm actually going to go back to our let's call this convert data types add columns in fact what we will do just to make it look a bit cleaner is add another derived column step so this is actually easily just breaking the steps down into small units of work so we're going to have a add columns again we'll expand that out so we're going to add a new column and when we in here we can simply just type in the name again we're going to open the expression builder and at the moment we just want to record a timestamp and we have this function here called current timestamp that's going to represent our valid from

### Segment 4 (15:00 - 20:00) [15:00]

a created data modified date we're going to add a another column for our end date and again instead of choosing an existing column i can simply type in a column name here i'm going to open an expression builder and in this case i'm going to be looking to build a date so again i'm going to be building a timestamp but i'm going to want something similar to what we had before in terms of conversion and we're going to have two timestamp and then the validity period is going to be very far in the future month 31 and then we can add in our time as well so that's going to convert that string to a timestamp for me the last thing we want is a validity column so i'm just going to call this validity and notice i'm not giving it the same column name as it is in the sync i just prefer to do that to understand exactly where this data is coming from and what we're mapping to you'll see with working with data flows especially combining different data sets that have the same column names can be tricky and to uniquely identify them you usually start with the data set name but sorry this stream name and then the at symbol so again for the validity i'm just going to open the expression builder now instead of having a it's actually a boolean value or a bit in sql instead of having just a one or zero we can add a true if we save and finish that we then go back to our mapping exercise expand that we can now add in our valid from which is going to be the start date we created our valid two which is going to be the end date we created the is valid which is going to be our validity column and we can see as we're doing this we can see the data types of the inputs and outputs actually match again we're going to be using start date for created date and start date for modified data if we validate our data flow we can see we've actually got no errors there so we can see that data flows are a very simple no code solution that allows us to pick up data perform transformations and then write it to a destination they're very easy to get started with if you have experience with integration services or other etl tools then you'll find these quite familiar a variety of transformations we can make as well um a lot of built-in different data sources so now what we need to do is we're going to add our data flow to a pipeline so we're going to create a new pipeline i'm going to give that the name of pl customers just hide properties there under move and transform we've got this option of something like that dragging on a i'll just call this the data flow name of df customers we then select our data flow any data flows created will appear in this list we're going to publish our changes at the moment we don't have devops integration set up within a within our data factory so it's just publishing to here and then we're just going to simply trigger this pipeline click on view pipeline run and see that's going to run for a short period and click on our sunglasses icon so the first thing is we can see the data flow status is currently acquiring compute that's going to take a few minutes to start up a machine and run that data flow

### Segment 5 (20:00 - 21:00) [20:00]

and then once that's done we'll take a look at the results okay we can see that successfully run like i say the source of customers was actually a million rows but we've only got 10 rows 10 customers within that file we can see this has been executed successfully uh and actually processed those results in 21 seconds it was just that it took the cluster two minutes 28 seconds to actually start up and we'll do some other videos in the future on cluster startup times i've actually got a video this week on how you can optimize that if we have a look in a database now at our dim customer table we can see we've actually got uh our data written from our file um into correct data types like say keep an eye on source id that's what we're going to be using for scd processing next week we've got our valid from valid to dates we've gone out is valid and we've got the date and time that we wrote the data so we always like to add in created date and then modified date is going to store a history of when we made that change for the scd so really hope you have enjoyed that video as you can see data flows very simple no code solution to allow us to pick up data perform transformations and write it to a destination let me know in the comments your thoughts on data flows and if you'd like to see any other videos on data engineering or data analysis thanks a lot for watching
