# Lifecycle of Apache Spark Runtimes in MS Fabric Unpacking Experimental Preview, GA & End of Support

## Метаданные

- **Канал:** Azure Synapse Analytics
- **YouTube:** https://www.youtube.com/watch?v=1nlqp5Dv6ko
- **Источник:** https://ekstraktznaniy.ru/video/44746

## Транскрипт

### Segment 1 (00:00 - 05:00) []

hello everyone welcome this is a new episode of fabric espresso and today we have a new guest aai tal is joining us directly from the fabric product engineering group to talk about the newest Innovation that we shipped fabric runtime 1. 3 which is based on Apache spark 3. 5 so before actually as this is the first time you are joining us could you please take a moment and introduce yourself and tell us more what are you working on what are the features you are delivering and building with your team thank you EST for having me here on espresso and I'm glad to share my uh point of view and my experience on what I have developed recently in Max of fabric I would engage with the engineering team here and develop multiple support where users can easily switch the run times and experience the beauty of each run times whatever they want to experience pretty easily and on top of that I releases new run times recently I'm working on releasing spark 3. 5 runtime and it's a great success and I'll talk about that as well to talk today awesome so let's proceed directly to this Innovation so in fabric we ship runtime which is the fun mental element of all data engineering data ingestion as well and data science workloads the runtime is based on Apache spark in the past we released runtime 1. 0 then 1. 1 1. 2 and now it's a turn for 1. 3 based on Apache spark 3. 5 can you tell us more what are the ingredients so components of this runtime if I talk about collectively as a whole there are lot of components we build as part of our time it could be I'll tell in very generic terms like it's AI it's data engineering and obviously data science and AI are different right so it's data science as well and machine learning as well it's huge you can explore in many areas you want to and Microsoft fabric provides all these functionalities seamlessly into it now if I talk about specifically about let's say fabric R typ 1. 3 so we delivering spark runtime 3. 5 and Delta 3. 1 in this so user will have lot more features to explore and uh we are providing a complete data science Library which is part of the runtime itself user can explore and write their code right in any language they want may it be the python or Java or Scala it could be anything and explore the beauty of our product awesome so I want to add a comment just to make it clear that we are taking the open source because Apache spark because of the name Apache it indicates that spark is a open source project run by Apache foundation so we as a product we are taking the open source Library a new version 3. 5 of apachi Park and then adding a lots of I think that our documentation states that 10010 components with almost a thousand of different libraries into every new runtime version we make sure that all of them are compatible we are at the end building baking this runtime and releasing into multiple phases for runtime 1. 3 we release experimental public preview and now full scope of public preview and we are on the way to make it named as GA so ready for production workflows actually as you drove the exper mental public preview of this runtime and it was the very first time we decided to limit the scope and delivered a new runtime version with the newest spark with the newest Delta with the newest python as soon as possible but can you tell us more what was the scope and what decisions led to releasing that runtime this beautiful runtime that we are delivering this time we have divided that into multiple stages and those stages were like EXP experimental and then the public preview and we going to be G pretty soon on this one so the scope of experimental was to give a basic prototype to the users so they can right away get into the latest available technology in the market and start using that and that is spk 3. 5 on top of that the latest Delta which is available that was 3. 1 and you would love to hear that our customers truly love it they directly jumped onto it and started exploring the

### Segment 2 (05:00 - 10:00) [5:00]

experimental runtime and they were amazed to see the results we started adding lot more new libraries as well to it for data science and adding lot more new functions and our customers truly loved the complete scope which we delivered for experimental also we had the enhanced security as well so in experimental we take extra caution that the security comes first the Privacy comes first and all that should be delivered right from stage one for the fabric in runtime that we are delivering so keeping that in mind we deliver the best to our customers it makes sense so can you show us how to change the Run type multiple runtime support Works sure so that's how our Microsoft fabric data engineering UI looks like and I have just created a workspace for our demo today and there are two ways to switch the run times and it's very easy very Sim is integrated in our product so what you have to do is let's say you have to update this Spark Run time at the workspace level so that all the artifacts may be any notebook or SGD or it could be any other job all those should use the same run time then I will suggest you to use or switch the runtime using the workspace settings option so for this you have to open the workspace like this and there's an option called workspace settings you go to workspace settings and scroll down and you will see data engineering signs and there under the spark settings there is something called as environment Tab and in the environment tab you will see a drop- down named spark runtime version and in this you will see all the available run times in our product so you see 1. 1 with spark 33 1. 2 with spark 34 1. 3 with spark 35 let's say currently I'm using 1. 3 in my runtime which is the latest one available and it's in public preview let me switch it to 1. 2 because let's say I want to go to a g runtime so I just choose it and it just shows me some information here which is useful and gives me a link to the runtime documentation also to understand it better and I can just click save it will just take few seconds see how quick it is and you are all already all set to use runtime 1. 2 now let's just validate that in our notebook and I'll open a notebook here and this is my notebook and I'll check what runtime I see for this notebook so I just click this drop down at the top it says that at the workspace level we have runtime 1. 2 selected so I'll just select this one and I'll start using this very easy very simple and uh it's starting a new session in the notebook it will just take very less time just few seconds and you'll it will print the runtime which we are using and there you go the session started in 3 seconds and you have the spark run time printed here this is 34 and that's what we have chosen fabric 1. 2 is using a spark 34 can you demo how to switch to runtime the newest one 1. 3 one more time 1. 3 all right sure so let's go back to the workspace settings and let's choose the most recent one from the data engineering science go back to the spark settings and go to the environment tab here and let's choose the most recent available that's 1. 3 and you can see the documentation also if you need some information and you can click save here now once it's saved see how fast it is just it's a matter of few seconds not even seconds I should say and let's go back to our notebook and let's quickly test this one so if you want to validate what runtime is currently available or set on a workspace level it's runtime 1. 3 public preview right and this is spark 35 and Delta 31 let me validate that let me start the session in my notebook and this code will print the spark version I'm using to run this notebook at this time and see the session started quickly in Just 4 seconds and uh there you go you see the version of sparkus 35 and that's what is the version used in our fabric one SP in time that's awesome so we can change we can upgrade we can if needed downgrade always without any problems the T rule is that for production workloads we should always use the latest G runtime version which is at that point runtime 1. 2 soon we'll make and name a rtime 1. 3 GA then it's ready for production workloads with a full slas We are following the same rules as entire Azure is following now aai can you tell us more why we should use runtime 1. 3 what

### Segment 3 (10:00 - 15:00) [10:00]

are the features worth of switching to that version our runtime 1. 3 is massive it is providing lot more improvements over runtime 1. 2 and if I start listing down it's a huge list but I'll walk through the most important ones so the few important ones is like there are uh few new functions which are added the SQL functions right and I found those amazing because I love Financial numbers right and I love numbers and I always wanted to print numbers in a very specific format let's say with specific uh digit separations let's say with the comma or with the decimals or may be a currency or whatever format you want to and there was no straightforward way of doing it till now in spark 35 for the financial analyst they have added this feature they can simply use two whereare provide the format and print the numbers the way they want to and it works seamlessly and it works beautifully for all the numbers even for the big numbers as well so that's one of the great features we have recently introduced another one is the H sketching so this is a hyper log function which is added last year I had to use it we all let's say if you are analyzing any data you always have to see how many unique values is there in that data right and last year I was doing some similar analysis I couldn't find any such function in this Spar and I was surprised to see that but this year they have added this function and it's added with a very good performance and it's running really fast the demo you see they simply added like you can put the values or you can pull data from your tables as well and just create a sketch and just run the estimate on top of it and it will give you how many unique values you have beautifully that's awesome because I can pass not only values uh as a AR or some form of AR but also I can provide a column or table just for the entire data set awesome and on top of that it it's all distributed function so it will run very quickly it whatever the size of data will be it's using spark Beauty end to end so it's super amazing and then what I see that they have added this identifier Clause as well so what is identifier Clauses that let's say if I say s star from a table name and some and the hackers started using these identifiers let's say the table names and the column names to hack into your databases as well so spark came up with this thing called as identifier Clause that protects you from SQL injection attacks and now you have it and you do not worry about SQL injection via the identifiers in your product anymore and that's another Beauty which is added I want to talk ask about user defined table function before the experimental release I was going through the spark 3. 5 release notes open-source release notes so here one more point we ship Apache park with its beauty plus we are shipping our own native optimizations like by native I'm talking about integration with ADLs Gen 2 integration with entire fabric ecosystem similar is happening and happened in the past for synapse but when I was going through the release notes for Apache spark I found user defined cable functions as the most appealing because it looked for me that I can apply a function to the entire table immediately can you tell us more how it works yes and Spark 35 provides you that opportunity aser so let me explain about that there are two things one is the udfs and another is the udfs right so UDF is something like that returns a scalar value a single value udfs in cont returns a complete data set or a complete table in itself and that's the beauty of it right and UDF has always been there but there is always a demand of using udfs because we have to return a result set from some operations that we do as a whole in a single unit and then return the results to the user for further analysis here is a simple example that I have taken to explain how exactly this is coded and this works so let me scroll up a little if you let's say I have added one class it's in Python that says that this is the eval function this is a the signature that how we used to define it and the yield denotes the row that you are creating that will be returned for this it could be a slick star from some table as well so it's just a simple example that I'm taking here so what I'm returning a row with two columns let's say this is a value for column one and this is the value for

### Segment 4 (15:00 - 16:00) [15:00]

column two right and then I'm converting this class into a UDF so it says my uh UDF and then I'm creating it as a UDF putting a class name here and defining the return result set type which has two columns of string type C1 and C2 and then when you say my UDF do show it prints the output like this like a data set do data frame. show we do similarly we are doing UDF do show and you see two columns in a tabular format and the output here so that's the beauty of this one and it's not just like you need to use it as a p spark you can use the same in format of fql as well so if you see here I'm just registering the UDF udtf like this and then using it into spark. SQL command putting like six star from my udtf function and then it's printing the complete result set so see how easy it is to use and code and it's like seamlessly a aable also to be used actually thanks a lot for doing the demo we will dig into the details of apachi spark in the upcoming episodes and for those who are watching us remember to leave a question leave a comment and hit the like button and until the next time happy exploring a around time 1. 3 and using the latest features that are coming directly from Apaches spark 3. 5 and Delta Lake 3. 1 thanks a lot