Apache Airflow Job in Microsoft Fabric

Apache Airflow Job in Microsoft Fabric

Machine-readable: Markdown · JSON API · Site index

Поделиться Telegram VK Бот
Транскрипт Скачать .md
Анализ с AI

Оглавление (8 сегментов)

Segment 1 (00:00 - 05:00)

hi everyone welcome back to our YouTube channel dedicated to Microsoft fabric data engineering and data science topics and essential part of playing with data and discovering some science embedded into them is data integration that's why today we have a special guest abishek who is joining us from data integration data Factory team abishek thanks for joining us it's the second time we have you here yeah thanks s for having me here so I enjoy you know answering questions from you and also demoing all the new capabilities that we are building and that gets me excited thank you awesome can you mention what you are working on what are the features that you are covering as part of product group at Microsoft sure so yeah I W on a few features I'm part of the data Factory team in Microsoft Fabric and so of course our focus is to build the best data integ ation Solution on Earth um and so with fabric I think these are you know becoming two best things together making it SAS and making it easier to use so I do work on a few features within data Factory and one of them happens to be you know bringing Apache airflow uh as a native item in fabric so that customers can use it as a serverless offering so we're going to talk about that today yes you perfectly reveal what's the topic is data workflows in Microsoft fabric so manage Apache airflow jobs let's start with the intro because we are going to talk about Apache air flow in the context of a managed Service as part of Microsoft fabric but can you give us a perspective what is Apache airflow what's the value why we should dig into another technology which is again beautifully integrated in our platform but what's the value what's the gain yeah absolutely and quick overview of Apache airflow so Apache airflow is powered by Python and as you know python is extremely famous I mean it's extremely useful a great language that uh data scientists data Engineers love about so Apache a flow is powered by Python and so you write tags directed acyclic graphs which are basically your python code the way of defining a pipeline or a data pipeline that's what you do with the Apache airflow dags and why it's famous in the community is because of course it's a Pache uh product and it's open sourced so it's very well taken by the community at the same time it gives you beautiful monitoring experiences for your tags and it can get extremely complex when you're building complex tags or complex data pipelines which has a lot of workflows stitched to it and a lot of conditions in it like on failure do this on success to that it starts to get more and more complex to monitor such workflows what if this fails what's the path of execution and things like that and so it gives you that beautiful monitoring flow experience there and then it also helps you know it's code based so it kind of you can pretty much do everything what you can do with python there in terms of orchestrating data workflows the key factor that influences everyone to use data workflows is its extensibility especially if you're committed on a platform you do want to understand what's the support and extensibility of that so if is extremely extensible customers can create their own operators I'll explain it in a bit uh it's basically functions that you use to do certain actions so you we call that as operators you can customize ex executors where you execute and then you can extend libraries you can build your own libraries and I'll show you in this example how we have created a library that can be leveraged by anyone using airf flow so it's very extensible it's OSS Community Driven so you know there's you driver behind it and then uh you're bringing best of both the worlds together uh you're bringing the OSS community and the support extensibility into Fabric and you kind of bind Microsoft support and the OSS support together you get you know best of both the words I love how you nailed that question at the same time want to ask okay because in the past the two hearts of data Factory was Data pipelines and data flows gen two right now we are getting a third heart I assume that there are some overlaps with data pipelines but I will ask what are the key differences when I should use data pipelines airflow that's a great question I was hoping for it so um quick thing was before I move on to that question is the extensibility part which I wanted to highlight these are like all the providers that you know you can find for airflow it has a huge community and as well as ability to

Segment 2 (05:00 - 10:00)

connect to different data sources and different types so these are some primary features of air flow now moving on to est's question which is basically how do we see airflow fitting in data Factory and customers not being confused now right you have data pipelines which is very visual very easy to use for orchestration purposes for data workflow orchestrations and it's already available today in data Factory and uh customers can absolutely use that if that meets their requirement because it's visual easy to author visual drag and drop it's certainly you know extremely easy and quick to get started now where Apache airflow comes in is think of it as a protool where if you need more extensibility or customizations right these are capabilities that you would get inherently in aache airflow or the data workflows especially with you know the OSS extensibility and the previous slide that I showed uh the extensibility that it has and the growth it has due to the OSS contribution so if you want those capabilities and you want a future proof in terms of extensibility the customizations that is when you would use airflow of course you have to write code in this you have to use Python that is also a prerequisite that you would need to have is the skill around writing code to author pipelines otherwise you would use data pipelines which is visual so this is how we see customers using data pipelines choosing between data pipelines and airflow whereas the data flows is basically a transform visual transformation or locco transformation tool and that is independent of airflow because you can orchestrate actually you can orchestrate a data flowers job using Apache airf flow it's not an orchestration tool it's a transformation tool visual one though and so you know this is how this whole ecosystem within data Factory sits as of now abishek thanks now it makes sense that for orchestration monitoring I will go with airflow just to simplify and for data Transformations data Pipelines data Factor are still allowing me to use more than 100 different components just to build my pipeline transfer the data save it in a lake house and take the data from any data source which is supported I can't wait to see the just to connect the dots uh with the value and how it looks like especially for the devs who are watching us absolutely yeah um excited to demo so we introduced this feature during the build conference this year in May end of May and so it's in preview and you can find the blog we'll put that in the link uh in the video and this is where you can find what airflow is what are we creating out of that we certainly add a lot of capabilities beyond what is already available in a flow that kind of adds more value for why you should choose airflow in fabric versus you know you're running your own airflow and this blocks talks about those but I would basically run through each one of them it talks about things like uh instant Apache flow and time provisioning it talks about versatile Cloud you know based IDE uh that we give you can author everything inside uh fabric itself it supports Auto scaling it supports intelligent autop pause of clusters helps you save cost it has enhanced security it's a integrated of course so you know you don't have to pass in basic o anywhere within airflow it supports all the airflow plugins and it gives you Custom Pools which I'll show you it gives you more flexibility in terms of if you want a dedicated cluster always running uh there are scenarios for airflow to have that where you want some things to be running all the time and you can do that via creating a custom pool and keeping that pool always on so we have those capabilities of customizations as well so now pretty much let's get started and this has getting started instructions as well since it's in preview and if you're watching this video after it's G you don't probably need to do the prerequisites since it's in preview at this point of time when we're recording it you would need to enable this capability at your tenant admin level so that you can try this feature out so this goes on for any preview feature in fabric it's a standard procedure admin portal you need to log into the admin portal of course you need to have the admin privileges once you do that and then you will be able to see data workflows and turn that feature or capability on for your tenant once you're done with that you would now see something of the sort that I see so I'm on a data workflow workspace here and I'm going to click on data workflow to create a new data workflow pflow project so let's click on this one and give it a quick name so let's start with you know one and then create it now what happens under the hood is we are able to provision the cluster instantly as you see one of the Val prop was that you can get clusters instantly so now what you see is an IDE where you can actually add tags here in the Explorer View and then you can also see you can monitor airflow you can view the connections in air flow those things and you see a button here

Segment 3 (10:00 - 15:00)

which says uh stop which means right now I got a cluster instantly which is running and it's going to be autop paused if I'm not going to use it for 20 minutes so it helps save a lot of cost especially in your Dev environments now I can click on this new button and I can say you know a simple hello world code that I will be writing um and then you can see that you can create different kinds of files because in airflow it supports for example if you're orchestrating DBT you might want to write SQL and aaml file so you can do all those different kinds of files within here using data workflows here I'll just simply set up a hello world program and you know it has a very simple code which does an echo hello world and you know you can change this to anything but it's simply showing you the capability of how quickly you can get started with airflow a simple python code which is printing hello world and then I can click on these are some of the Native experiences which you don't get up in airflow doesn't give you the authoring capability or the ID capability but this is all integrated in SAS and fabric and hence you have this authoring capability you can also debug it like you can click on this run debug click on save and run and you can see the status of it and you can also find interestingly the status of these tags once it's preparing to run and once it starts to you know cue this run starts running it you can see step by step all the state information being passed in this UI now you can see a TR The Bash operator over here which printed that what we have done additionally here to make it more native is your whole authoring experience troubleshooting experience goes to the next level because now you can click on these tasks and you it'll actually take you to actual airflow monitoring if you want further details for example I can click on this task it takes me to a new browser and it actually shows me that specific task so it gets me inside airflow and that hello world Tusk and I can actually see all the output I can see the log XCOM is basically the input and output that you pass to a given task here the return value is hello world and you can see it right here right that's the one that we were echoing and printing right so the whole point is this is airflow UI this is native airflow UI so we get best of both the worlds we are not cutting off anything that the native airflow offers so here's my dag which got created here through the IDE in fabric I can see all the runs here uh task information here at the task level and it's all very well hooked from my offering surface which is in fabric so this is the key feature that I want to show that we have built on top of airflow and integrated airflow with fabric more natively where you can author things in Fabric and then it's all connected to the airflow UI and monitoring systems as well so that was a quick demo of like a hello world program and how you can quickly get started in seconds with instant cluster provisioning in Microsoft fabric for Apache airflow I love that you presented the hello word example as again this is a starter can you show us something more advanced some business case just for those who are watching us and looking for using airflow for the problems challenges absolutely and so let me combine or start connecting the dots we talked about extensibility and let me show you example of extensibility as well right so what we are doing now is we are creating custom plugins natively for airflow that will help us run example a notebook or a data pipeline using airflow orchestrator and this plug-in that we are writing here is just a plugin which can be you know used anywhere and it'll be converted to a public provider which can be you know used by anyone in the future it's in process uh probably it'll be a Provider by the time you're watching this video uh so this is extensibility where we are extending the capabilities of airflow to support custom operators and in this case fabric plug-in we which we are creating and this allows you to create a fa or run a fabric item is for example a notebook data flows job you know a pipeline uh data pipeline all these items can be executed using this custom plug-in operator so it's an extension here's the documentation which will put in the link and now let's see a more concrete example more real world example what people do with airflow is we are using the same plug-in here that we just showed in our documentation here and we are importing the fabric runtime operator here so that we can run these fabric items and so what this is going to do now is as you see it basically has the name ETL with data workflows is basically the instance of the dag object name that you will see in the monitoring

Segment 4 (15:00 - 20:00)

and then it runs this particular dag and then if you see it basically is using a medallion architecture uh or representing a medallion architecture I wouldn't really say that the code inside the notebook is really what you would expect but this is more from the data workflow perspective or orchestration perspective I'm doing a medilan architecture where I'm using a fabric pipeline to load the data into the bronze layer and then this is where I'm specifying the task ID and I'm specifying the item id which is basically the pipeline guid or the identifier in here and I have created a connection called Fab connection ID of course you need authentication while doing so we have as part of the plugin we have a generate connector and that will be a fabric connector in airflow and using that you will be able to orchestrate we doing it VI SPN at this point in time and this is running a pipeline and then we are representing silver layer which basically transform the data using notebook and load it into silver tables so from the bronze and so we have a notebook representing that and then finally we have um you know transformation in gold layer and then you know curating that data so that it can be leveraged by reporting so we have the third task over here which basically is running the notebook and again and then finally what we are doing is we are using a PBI data set refresh which is now semantic model so we are refreshing the semantic model so that my reports can be refreshed as well once you know the whole Medallion architecture data loading e2e process is done so it's an interesting tag and then it sends an email and stuff as well so it has you know a few HTML component in here but now let's have a quick look into how this dag looks like and how you would troubleshoot this tag so as I showed you that you have this run capability here you can troubleshoot it like you know right from here by clicking on the Run thing which is awesome for example click on this and it starts running this particular dag we'll give it a few seconds uh before it populates you know uh start skewing this up and then it starts populating each of these tasks uh so by the way tasks are smaller functions unit of work within the dag is basically a single Pipeline and then within the single pipeline you can have multiple tasks are smaller items in it so this is you know how you can run this and you can monitor this but let's quickly jump into airflow UI itself for monitoring from here so that we can monitor this complex task so when I click on this is my ETL with data workflows and I can click here to quickly first let's have a graph view to analyze what this is doing so graph view lets me quickly analyze my code that I had written and the dependencies that I had created in a more graphical manner it's amazing especially if you're troubleshooting things to be able to have a look into this and you can see the different states you know of executions the green ones the dark greens ones as you see is basically the ones which have executed especially if you're opening a execution or object of a run you would actually see these color codes the greens are success Reds are failed pink is skipped and so on so you would see different status for different runs which is awesome and then it automatically keeps getting updated but for more importance this is what we were doing we were ingesting data from a fabric pipeline so we are able to orchestrate a fabric Pipeline and then we are calling a notebook and again a notebook to load into gold and then updating a semantic model and if you were to monitor this using airflow using this custom plug-in that we have created it's extremely simple let me go back to the grid view these are all my fail runs you know recently it's actually great because it's actually running well I can click on any one of these let's monitor the pipeline click on this Green Dot here it shows you the task and what we have done and this is the beautiful extensibility of airflow is not only brings everything all together it gives you know extensibility to jump into to monitor further details for example pipeline has its own in-depth monitoring right which I can use notebook has its own notebook level monitoring which I can use for that particular execution so it kind of shows you the status here uh which is great but I can click on monitor item run and when I do so it actually jumps opens up a new fabric Tab and it shows me my pipeline Run details the status of it and it's great because you can click on this and you can find much more details in this case it was successful that's great but then you can find things that you were able to move I was able to you know get load about 570 MB of data into my raw data so you know you can find the audit Trails you can get into the details you can see what failed what not failed and all from like a single source of Truth which is your you know orchestration Pipeline and the same thing applies for other tasks like for

Segment 5 (20:00 - 25:00)

example for Notebook I can click on monitor item run and it now jumps into that specific notebook run that happened and what was the error and the exception that you had and you can troubleshoot those over here and the same thing applies once again for my semantic models which I can actually click on here and see monitor parbi data set so it now takes me into my data set or semantic models and I can actually check the refresh histories of semantic model on when was it last refreshed which happens to be the current date so this is how you can you know very much get into the details from airflow in this example we were talking about extensibility of how we extended this to a native operator for fabric so that you can run fabric items and be able to monitor those fabric items all using aache airflow so this was more real world demo that um you know customers build like uh famous patter Medallion architecture and then refreshing PB at refresh reports you know this is what you are seeing in this particular tag that's phenomenal I want to okay this is embedded integrated natively into Microsoft fabric I want to ask what was your prefabric so now embedded integrated within our platform so those who are watching and hearing airflow for the very first time I think that the value is clear for those who are running on Prem or cloud and usually and that was like a few years ago in other company I was working on setting up airflow on two VMS just to make them in a high availability mode so what was the era in terms of using setting up and getting the value out of airflow before you bring this beautiful feature uh great question so we started off building this feature because of the pain points that we heard from customers uh in terms of creating airflow and managing the security of airflow and then managing updates of airflow so there were tons of concerns and problems that customers had especially you know running open- Source tools can you know be more challenging than running you know your own proprietary tools because there's an update there's security incidents that happens and then you need to roll out fixes and stuff like that there's a lot more challenging than when your code is completely owned by you so a few things that customers have to go through today if they were not using a managed version of airflow there are few managed versions of airflow every cloud provider offers but then if you're using your own version of airflow the biggest challenge is getting started setting it up because airflow is a runtime that needs distributed compute systems like AKs you know or batch to be provisioned so that it can scale out and so on that's very challenging even when you do that it takes probably 23 minutes to get you started and so imagine your CICS imagine your Dev environments imagine you know hundreds of environments that you create it just kind of kills a lot of time making you less productive so you're spending more time in doing things which probably can be automated or can be done quickly so I think that getting started is very painful you know Journey that we heard from customers um that is I'll switch back to my blog uh as well the announcement blog so that piece which is instant Apache airflow runtime provisioning is one of the key areas that we focused in terms of improving and even if you pause this clust started again we strive to bring it down to like a minute or two uh to get it started versus in real world it might be like if you're doing your own it might actually be good amount of time uh 15 20 minutes uh for example in ADF we supported this feature and it took about 20 minutes to provision clusters and we brought that down to instant uh you know there's huge amount of engineering work that has happened behind this uh to make this happen this way apart from that ID was one key piece which was missing if you standard alone like if you have your own Standalone airflow you will have to use your own code and you will need to have a mechanism of deploying your code to your clusters um IDE basically makes it super simple to get started a few other engineering enhancement that we have done which is not available in airflow is like Dynamic Auto scaling based on the need and then intelligent autop pause because we know that no customer wants to pay for the dev environments 24 bar 7 we heard this from customers and especially when they create these by themselves it takes so long that you know you would let it live forever uh rather than deleting it and creating a new one the other day so these were some challenges we heard from customers so we made sure that we can pause and resume very quickly integrating with a and tri ID as we all know that when we work with customers concerned with using OSS at times is you know it's not well integrated into my security and audit systems right entri ID provides you all those capabilities of audit you know IP based allow list and things like that you know we live just like an application under entra ID so it kind of

Segment 6 (25:00 - 30:00)

gives you that additional security layer in case if you want to implement entra Security on top of what you're using so that's another key capability that gets added and thanks to entra uh Team of course this capability that we get and then support for plugins you saw that in the example how we could extend it to a fabric plugin so that we can run fabric items and so on and then Custom Pools which I didn't touch upon but I would want to touch upon is at times there are requirements or needs uh from customers where they would want to keep this cluster running 24 bar 7 and a good example would be your production cluster you may have 100 Dev environments but you have a single or maybe a few production environments and you want to make sure that it's 24 by 7 because there's so many jobs being you know queued upon it that it needs to keep running all the time and so this is where you can create custom pools it gives you flexibility over there it customizations as well and you can keep such pools running 24 bar 7 uh to meet your needs so you have both the options available best of both the worlds you have a serverless offering which Auto shuts down and then you have a more airflow like offering that you get today is you can keep it alive if you wish to keep it alive so that's what custom pool allows you to do fantastic I want to ask about two more aspects before we'll unpack the topic of pricing how it's built to use airflow managed airf flow but before just to have a full view okay service managed integrated within fabric now what about because this is about devops like that's my perspective what about two aspects integration with Git and then how the cicd looks like but let's start again with a baseline of cicd which is G integration okay so that's a great question so this airflow is integrated with Git um and I'll show you that in a quick demo so for example when I create a new data workflow uh which I did here I do have an option of integrating this with Git for example if you click on the settings and go to the file storage you would see an option called git sync so gsync is a great capability that airflow natively supports what it does is it uses git as the source of Truth for the code files and so instead of storing the files locally is going to use this git sync as the source of Truth for the files so you can choose get sync you can use any kinds of Provider these are the top four providers that airflow supports GitHub Ado gitlab bit bucket these features comes along with airflow to our system so we have integrated that here that you can put your code anywhere store it anywhere if you're having an existing code or existing investments in any of these repos where your codes are stored uh you can very easily Point them choose you know the credential type here and then simply put in the repo in the branch you know specify the branch that you would want us to use as a source of Truth and then you click on apply this gets you started with your existing git Investments that you have in terms of your airflow dags and then you can get started in seconds that's the git integration and then you know the same git integration the git SYNC feature can be used for your cicd purposes as well so it kind of extends into cicd to because it gives you the option of providing a branch so now imagine if you were implementing cicd and if you wanted to have a Dev Branch like in your repo which is basically for your developers to kind of do a code checkin into the dev branches and then test it out in your Dev environment or a QA environment and then finally once it's tested and approved do a PR into a new branch called release Branch so we have that documentation in our ADF documentation which will be bring in fabric as well so you know this is pretty much at a high level what you're seeing here is you start with your Dev git branch and you check in the code into the dev git branch and you now sync that get sync that with your Dev data workflows right and then you can test it locally uh not locally as in on the cloud locally you can log in and then you can test it in Fabric and then once you have done all the testing you can create or do a pull request to create a new Branch or merge it to a new branch which is in this case release branch and you can have another data workflow which is get sync with the release branch of the same repo so this way you are able to iteratively develop work on your Dev Code test it in Dev data workflow and then once you have tested it validated it and then the moment you do a PR into your release Branch the release data workflows picks up you know all the changes from there and then this is how your cicd story works with your dags that's super complex and I see that again we addressed all the main

Segment 7 (30:00 - 35:00)

pillars now there is no reason to fight a battle with VMS with infrastructures the service and set up airflow when we have that embedded simplified one aspect that I believe a lots of our customers will dig into is the pricing because again the move of still using VMS or kubernetes cluster is pricing and can you brief us how to calculate the pricing for managed airflow in Fabric or how the pricing look like oh I see that you have slid fantastic yeah absolutely so I know we can't go without pricing but uh you know you saw all the value that you get uh instant cluster provisioning you know all the security features and then you know the features that we talked about like smartness in terms of shutting down the Clusters when not in use and so on with all those things this is how the pricing looks like so we have two different kinds of pools based on the workloads like if it's a small or a Dev you would use small pool type uh if you have something which is in production or heavy workloads or orchestration that you're doing you would use large and so the prices you know looks like this is in terms of the cus as you know fabric billing happens in terms of the CU units uh so this is basically those unit costs and not the dollar values uh but you can translate this to dollar values based on the region that you are in uh so typically for a small for one hour of executions this is hourly rate of course so for every hour you would pay five CU you know in case of large you'll be build about 10 cuu so it's double because it's the double the compute that you get over there and then if you choose to enable features like Auto scale or add additional nodes you know the in the initial pool that you create you get three nodes because we have to support High availability so you already get multiple nodes over there but in cases where you're running let's say you know 100 dags concurrently and you need more you know scale out to happen and that is where you can use additional node and that you can use by enabling autoscale and setting a Max condition over there and so once you do that these additional nodes that gets added is you know much lower fraction of you know the pool cost and of course that makes sense because orchestration of the pool is what we are you know doing when you're creating a pool but adding a node is a lightweight thing so it's much more valuable in fact it gives you much more value back when you're are running you know larger nodes larger in the sense you know scaling out your nodes because that's where your CU charges are actually more lower you're paying about6 cuu and 1. 3 cuu for small and large respectively uh for every additional nodes that you add so especially when you enable Autos scale your pricing wouldn't just get multip applied from the base rate uh it just adds little bit more for each node that gets added on and hence makes it more valuable for customers to use it abishek thanks a lot for this uh conversation for this session as the Apache airflow jobs in Microsoft fabric right now data workflows are super complex it looks that super well addressed I want to ask you what's the future what's the next step that you are planning for this functionality yep so we are looking for customer feedback we're more than happy to hear more customer feedback we have a few work items especially focused more on uh network security like making sure that uh just like spark you know data workflows supports manage vet as well so that you can connect to data sources behind private endpoints uh so those are a few things that we're working on and of course by G General availability we'll have those capabilities in as well Plus work identity workspace identity that's uh that's one work item we are working on but yes I do understand that we need easier mechanism for customers so that they don't have to recreate connections and so workspace manager Identity or user assigned managed identity should be something which can be used by customers for easily connecting to different data sources that's on our redar as well fantastic so for those who are watching us again you got the tutorial end to endend fully package with the story directly from the product group that created that functionality data workflows in Microsoft fabric I Che one more point so the demo you presented are super inspiring for those who want to start and try using Apache airflow jobs what would you recommend for them yeah so we have all the tutorials that you saw today uh we have put that those as tutorials under documentation we'll put the link here so

Segment 8 (35:00 - 36:00)

that you can get started whichever complexity you want to start with hello world or you want to run a ADF pipeline data brakes job or you want to run fabric item like notebooks we have documented all of them including dbts right so we have full-fledged tutorials that you can use to you know accomplish the things that you want fantastic awesome so thanks a lot for joining us for the second time for those who are watching us remember to leave the like button leave a comment leave a feedback reach out to us directly with suggestion for the next episodes what would you like to see and also leave the feedback about manage Apachi airflow jobs in Microsoft fabric as abishek mentioned we are looking to get your feedback as well remember about one website ideas. fabric. microsoft. com which we mon on a daily basis if you have any idea feature request that you would like to get implemented and get fabric improved please leave that idea there so until the next time happy exploring and using a manag Apache airflow jobs in Fabric and see you again soon thanks

Другие видео автора — Azure Synapse Analytics

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Дайджест Экстрактов

Лучшие методички за неделю — каждый понедельник