# Spark in Microsoft Fabric and Spark SQL for DW Users

## Метаданные

- **Канал:** Azure Synapse Analytics
- **YouTube:** https://www.youtube.com/watch?v=qzZvGfx16zM

## Содержание

### [0:00](https://www.youtube.com/watch?v=qzZvGfx16zM) Segment 1 (00:00 - 05:00)

hey everyone well to our Channel dedicated to Microsoft fabric this is a series about data ingestion data integration data engineering and data science within Microsoft public today for the very first time Arad is joining us to discuss the topics that he is driving within our product group of Microsoft thanks for joining could you please take a moment and introduce yourself thank you so much for having me here esta uh my name isali I'm a principal product manager at Microsoft as part of Microsoft fabric team I focus on the spark runtime area so yeah that's all about me Let's Get It Started awesome we are going to talk about spark for DW users for those people who are maybe not familiar with Apache spark but they can code with usage of SQL is super popular almost in all the schools uh in our high schools universities we are forced or encouraged to learn squl so let's start with the first question about like can you please introduce Apache spark for those who are not familiar with it yeah of course uh but before that uh let's set a bit more context for the talk so this talk is going to be focused on SQL Developer as a persona and how I can leverage my SQL skill which I have built over a period of time to do modern and big data analytics with this spark having said that let's get into more details of what Apaches spark is a fast and Powerful open-source Computing framework for large scale big data analytics with its unified set of apis it can handle various type of data such as a structured unstructured batch streaming machine learning or even graph data it also supports multiple programming language to choose from such as a Scala Java python or even SQL which we are going to focus in this talk spark fure to perform complex analytics including machine learning artificial and intelligence task on big data sets so the architecture of a spark is based on distributed master and worker nodes architectural pattern in a cluster of nodes in this case when you submit your job to spark the the job gets submitted to the master where driver programs runs the users main function and executes various parallel operation on different worker nodes as part of the cluster it means it spreads the processing of your query across all these worker nodes on which actual processing happens depending on how big your cluster is they pull the data from the storage location work on that data and once they have the data ready on the chunk of the data that they are supposed to work on they send that result back to driver node and Driver node returns that result back to the user so this way the whole thing is parallelized across different worker node and it's all software with solution so we as a developer or SQL Developer we don't have to worry we just submit the job and Spark engine takes care of Distributing it across multiple engine and executing it in parallel so Catalyst Optimizer is the query optimization engine which is very similar to the query Optimizer you are familiar with traditional database system the Catalyst optimizers optimizes structural queries expressed in SQL or even via data frame or data set API which can reduce the runtime of programs and save cost developers often treat Catalyst op Optimizer as a blackbox that magically just works Catalyst op Optimizer combines both rule based and cost based optimization technique it applies various logical optimization to simplify query plan such as predictable Poston constant folding and column pring by leveraging Catalyst Optimizer spark SQL generates more efficient execution plan for your queries this optimized plan significantly improves the performance of spark SQL queries and we as a developer we focus on right writing the business uh logic expressed in SQL and don't spend too much time optimizing the query this is where Catalyst Optimizer takes care of that part ARA that's great fundamental layer just to give our viewers the understanding of apachi spark could you give us a context of Apache spark on Microsoft fabric what do we have here of course uh now that we have Fair understanding of what a PES spk is and the use cases it it's used for it will

### [5:00](https://www.youtube.com/watch?v=qzZvGfx16zM&t=300s) Segment 2 (05:00 - 10:00)

be interesting to know that we as part of Microsoft fabric team we have taken the same open- Source Apache spark and made it Enterprise ready by investing into areas like security performance High availability management of infrastructure networking and Etc if you have an existing spark application you can migrate it to fabric spark quickly and easily with few minimal changes especially when you are migrating to the same spark version spark in of fabric Powers both data engineering and data science experience you use data engineering experience for data integration and data transformation you use data science experience for exploratory data analysis feature engineering and building and training AI models spark in Microsoft fabric offers worldclass serverless compute that means your spark session gets started in few seconds and it dynamically it scales up or a skills down Bas based on the workload that you are executing given that Microsoft fabric is based on SAS model spark in Microsoft fabric is fully managed that means we as a user or developer we don't need to create or manage compute infrastructure when we are working in spark in Microsoft fabric it provides flexibility to write our code in Notebook which supports multiple programming language all the supported programming language on a spark is Ed here as well and it offers an immersive authoring experience by providing the capabilities like co-editing and co-authoring in the notebook which we'll see some examples when we are working on it integrates with Lakehouse where you store manage analyze all your data of all type at one single location and once you are done with your development or you have created a notebook with all your spark code you can orchestrate that with the pipeline to run it in a schedul basis so these are the capabilities that's available in Microsoft fabric for a spark AR that's great now we have a context of what's a Pachi spark on Azure on Microsoft fabric to be specific The Next Step I would like to do because again we want to talk about spark for SQL DWT SQL users could you give us an overview of spark SQL versus tsql in the context of a Microsoft Fabric or of Microsoft fabric spark yes of course so spark SQL follows SQL dialects pretty much for all the operations that we have been doing with SQL and we can use them without compromising on the performance thanks to Catalyst Optimizer that we just talked about so let's get into some uh real examples and let me show you how does it work so this is uh notebook interface in Microsoft Fabric in this notebook we have multiple sales so you can have multiple sale in a notebook and entire note book you can set up or specify a default language for all the sales in that notebook so in this case I'm going to use a spq and at each cell level we have an ability to go back and switch to any different supported uh language so I can do that and I'll show you some of the examples but before we go down in this notebook let me kind of go ahead and kick this off uh for execution and while executing I'll explain the different parts of it so we can execute each individual sale at a sale level or we can kick up the whole notebook by clicking on uh run all here this is what I just did uh and as you can see my session has just started in few seconds and my codes are executing here so let's talk about some of the use cases that we deal with as a SQL Developer so the very first use case is I want to create a table and I want to execute some of the DML operation right so I have this create table statement from SQL world I want to execute it here and see if it is kind of working fine or not right I went ahead and executed this and as you can see my command has executed and the table has been created I can validate it by looking at the schema of this table so for that we have this command in Spa describe and the order table and that gives me the information about what are all the columns of this table their data types and if I have a specified any comment in this case if you notice couple of things I have this column defined as Vare is data type on SQL side and equivalent on a spark side is string so this has been taken care internally and this has switched from vcat to a string here as you can see when I created this table I didn't specify anything in a specific in terms of what type of table this is going to be created in fabric every table that we create is created by default as a Delta

### [10:00](https://www.youtube.com/watch?v=qzZvGfx16zM&t=600s) Segment 3 (10:00 - 15:00)

table and you can go ahead and you can validate it by refreshing the lak house view here let me kind of do this my table has been created here and if you see here this tiny triangle icon or Delta icon this indicates this is a table of type Delta and I can expand it here and see all the columns of this table of different type or different data types if I prefer code I can go ahead and look at the details of this table by um uh by using this command describe detail and the order table this tells me when this table was created when what is last modified what is the delta type the number of files if I have partition on the table and all those details will be available here now as a next step now the table has been created and by default it has been created as a Delta table let's go ahead and insert some records into that and again I have the same command which I used on the SQL side I brought it here and executed and you see the command has executed successfully and if I go and select from this table I should be able to see all this table uh kind of all these records getting reflected or being stored in that part of the table all right so far so good now let's go ahead and update some record so in this case I'm updating one record the next one is I'm inserting another record in this table and this all gets kind of applied to the table that we have so when I query this table I could see my record has been updated and a new record has been inserted so this is a very simple use case where we create a table and we run some DML operation on a table so it's as simple as running out of the SQL side bring the same code run it on spark side with a spark SQL and it all works fine without any change or even if there is any change it it's a very minimal change that we need to do in this example I didn't make any change I brought the code as is and it's all running fine so this is use case number one use case two is where suppose we are building a ETL Pipeline and we have onetime data that we have loaded into our destination table and we want to build this ETL pipeline to incrementally bring the data from source and merge into destination table or merge into my dimensional model that I have created right so for that let's simulate that scenario let's say I have this stage table and I have this three records into this stage table so this one record uh which is with sale ID 11 this is a record which has been updated in my source uh these are the two records that has been inserted into my source so this information is coming into my staging table I can use this staging table as a source and update my target right so I'm saying merge into this target table using the staging table that I have and join on this column update if there is any match in the joining key and if not then insert that record so I have this very simple merger statement fully supported here which you can take advantage of it and when I execute this it's telling me that three rows have been affected by execution of this merg statement which is kind of expected because we have three rows in our source table and it tells us that one row was updated and two rows were uh inserted into the destination table and we can validate Again by running this select statement on that table right so you can see this record has been updated and two news have been inserted here so far so good now Delta L maintains the history of changes that we are doing on the table so we can look at all the history by executing this command call describe history and we specify the name of the table so what we see here it's a list of all the EXs that we have performed so we started by creating this table then we inserted record into that table we updated some of the records from that table then we again inserted some cord and then finally we merge some cord we can go into the details of this and we can look into each of the kind of scenario and find more details about in this table but let's say I have this history of change that I have gone through in this table if I look at this current estate of this table it has 13 rows but what I want is I want to try and travel and see all the records which was before the merge statement was executed right so with version four so I want to see anything or whatever was the active at version three for that I go ahead and I execute this command where I say select a start from this table where

### [15:00](https://www.youtube.com/watch?v=qzZvGfx16zM&t=900s) Segment 4 (15:00 - 20:00)

version as of three right so in this case it goes and it gives me the version before the merge statement so this is what I see so this tracks all these details over here uh now the third use case is what we have seen so far is one table uh but often times we need to join this table with other table in the lake house and run some aggregation or create some aggregation this is where we can take advantage of all the capabilities that spark provides with the spark SQL and this is what you see as an example so in this case I have this P table I have the dimensional model here uh with one p table and multiple Dimension table I'm joining this P table with the dimension tables I'm doing some aggregation here by doing Group by on those specific column and running this query and you'll see this all data coming in as if you are running this SQL statement in any SQL database engine not only that say for example I want to materialize this information uh into a table so that I don't have to run this um query again and again so I can go ahead and I can run this create table as a select a statement or insort we say C Tas where we say create table and name of the table that we want to create to materialize this information and then as and then we specify the selected statement when we run this a table gets created with all this aggregate in this so if I come here in my lak house View and refresh this because the table has been created I should be able to see this table here ASA you have shown that there is a similarity how the varar versus string behaves can you share any other key differences or key similarities between spark SQL and tsql as this one fundamental looks encouraging yeah so there are very minor differences uh Spar SQL follows SQL dialects and um gradually it's moving towards supporting an SQL so all the gaps that you see or the differences that you see gradually will can be taken care but some of the differences which are kind of minor in my opinion is like when you run this SQL statement to only get few top rows you say select top 10 star from that table or 10 whatever column are from that table in case of a spark SQL that top is not there instead of that you say select a start from that table and limit to that a specific number of rows that you want to return to so those are very minor changes that you have to do but pretty much all the SQL code that you have written will work here there might be some differences in the functions that is provided out of the box spark like other uh database system provides hundreds of different Native functions in different categories like so you have several functions in a string category aggregate categories for creating Aggregates you have several functions for mathematical operations so there might be some differences in the way the function has been kind of named uh but pretty much what I have seen based on my experience if I have a SQL code I can bring it here and run it as is or even if I have to do some changes the changes are very minimal makes sense what about P spark meaning that in Apache spark we have multiple languages so spark SQL is one of them is there any difference between spark SQL and P spark or can we from Spy spark because sometimes we are for example more familiar or we need to use some libraries can we use SQL code embed it how does work yes of course you can um so first of all let's start with in terms of performance is there any difference whether I'm using p spark or a spark SQL uh not exactly because irrespective of whether I'm using p spark or spark SQL before the execution happens the optimization goes through the Catalyst Optimizer so C Catalyst Optimizer at the end generates a lower level code which gets executed and it follows the same optimization rule so it doesn't matter if I have a query written in P spark or the same query retaining spark SQL the performance is going to be pretty much same so that's number one number two is yes you can U mix and match as we were talking earlier right there are hundreds of different Native uh functions that we can make use of immediately but that's not all spark is very extensive in nature and you can extend the capability of a spark by bringing other library from outside or from your organization so for example there are hundreds of or even thousands of libraries available across pipie cond

### [20:00](https://www.youtube.com/watch?v=qzZvGfx16zM&t=1200s) Segment 5 (20:00 - 25:00)

Maven you can bring those library in aark and use it across different uh supported language so with that let me share my screen and show you an example of how we can extend the capability or native capability of a spark by bringing library from open source so let's say uh I have this table and this table has reviews of the product right and what I want to do is basically I want to see what is the sentiment of each of this review I know that there is no native function to give me the sentiment is score but I know there is an open- Source Library called nltk is natural language toolkit right so open- Source python Library which provides the capability of kind of cleansing the data by removing the stop W and then providing an ability to generate sentimental score so how can I bring that python library and use it in my SQL or spark SQL this is what I'm going to demonstrate so this is my table where I have kind of review and in order to kind of generate this sentiment test score I'm going to import this nltk library and this is the sample code that I have got from the analytic documentation I put it here right so I import it in my session and then I'm going to create two function one is to remove all the stock word from the review so that it makes generate the score on the actual comment this is one function and our second function is where I'm kind of generating the sentiment score whether it is positive negative or neutral and combining them all together and returning as one column here so this is the python code that you will find as part of the documentation of nltk I have brought it here included it in my notebook and creating this python function but that's not all I don't need to limit myself to use it in Python only I can go ahead and register it as user defined function in spark and use it across all the supported language so in this case I'm using is spark. UDF stands for user defined function and I'm registering it here so that it's available across all the uh supported language so once I register this I can go ahead and I can run my or call this function as if these are native function in a spark right so I'm calling this get sentiment UDF and passing the review column as a parameter and getting the result back so what we see here is this is my original data this is one sample of the review and this clean review is where we see all the stop words have been removed and the third column which returns is based on the sentiment es scope right so what we see here is a sentiment score for whether it's a positive negative or neutral and what's a percentage of each of them so that is what we see over here you can go and look at other examples say for example this is such a big review you can click on this and see again um the stop wordss have been the and you can see um the scores coming in so this is how you take advantage of extending the native capabilities of a spark by bringing a library from different world all right so this is one example when I use it with one of the table that I have but that's not all I can run this once I have this function register I can use it with any table where I have those kind of information so I have another table with the review this is a different type of review I can run the same SQL statement by calling those functions and passing the review and this will generate the same sentimental score for me so this is how I can extend the capability Asad great demo of the sentiment analysis now I would like to ask you do I always have to run spark sequel or is there a way meaning that as a Dev can I combine those two spark SQL and tsql yes uh that's a very good question yes of course so in fabric I as a data developer or data engineer I prefer to work in spark because I work on a large volume of data and want uh my data processings to happen across multiple noes so that's where I use a spark to take advantage of distributed computing once I am done with preparing my data I can expose that data or share my data with my business users through SQL analytics end point of The Lakehouse so that allows business users to directly run query using the SQL interface that Lakehouse provides in fabric so let me show you an example of

### [25:00](https://www.youtube.com/watch?v=qzZvGfx16zM&t=1500s) Segment 6 (25:00 - 30:00)

that so here we are in the lak house View and this is a lake house that we have and all the tables that I have from the dimensional model that I had created earlier and a couple of other tables that I created as part of my uh previous demo are all here now this is all good when I'm running it within a spark context or running a spark job on that but I want to expose it to end users to do that we can switch to SQL analytics endpoint um that's one way I'll show you the another way if we switch to SQL analytics and find we come to this interface where we execute the C SQL queries and that query is getting executed by SQL engine so in this case let's say I want to kind of create a new SQL query file and I want to query from say a fact sales table so I drag and drop here and I come here I say select count start from this table and run this at this time the query is getting executed on SQL engine and you see this the query has executed this table has 50 million plus rows and that's a response that we get here you can again this provides complete freedom in terms of how you write your SQL statement you have pretty much all the standard functions and syntax support here for writing tsql uh let me show you one example that I have created here right so this was one of the query that we ran during our previous example and we ran this in spark SQL but we can take the same query bring it here run it as a business user or as a and end user and this gets executed using the SQL uh engine so I can come here execute this and uh you'll be able to see the data coming in here so that's how based on Persona if I'm a data developer or data engineer I would prefer spark for data processing in lakeh house if I want to share it with a business user or if I'm a business user I would prefer to have SQL endpoint to connect to and start executing my code so this is in the interface there might be some tool that I would be using for example say AO data Studio or SQL Server management to Studio or any other tool that I want to connect to query the data and if I go here in my workspace for the lake house that we have here if I click on this I'll be able to see connection detail here so uh once we switch to Lake House view here you'll be able to see the lake house that I have created and with this lake house you'll see there are two other sub items that have been created here one is the semantic model that has been created by default on all the tables that we have in the lake house and there is another one called SQL analytics endpoint if we click on this triple dot here you see that's option for copying the SQL connection string so we can copy this SQL connection string this is the name of the server which we specify when we try to connect it from any SQL tool or any reporting tool that makes sense once I can design combined spark SQL P spark or I will use SQL analytics standpoint to design my ETL and DL is completed what are the way to orchestrate it on a scale yes so there are multiple ways uh we can do that and based on uh what we are trying to do and based on how complex we want to build that pipeline to right say for example I have just one notebook and I want to execute that Notebook on certain frequency so I can come here for this notebook click on settings and then under a schedule I can go ahead and specify the schedule this notebooks you run so this is one way to kind of start executing in a very simplified uh manner that's number one number two is say for example I have multiple notebooks and I want to execute one notebook from other notebook or basically I want to have a parent notebook and then uh the parent notebooks would call all the child notebooks so that is where we can use the utility which spark in fabric provides called MSS Park util and we can specify this or use this run multiple function and they specify the notebooks that we want to execute so in this example as you can see I have this three notebooks to load data into Dimension table this three will be kicked off and run in parallel this is the first statement once that completes I execute the second line of code which starts executing two other notebook to load data into the fact table and then finally there is another fact table which I want to run it after or end of this process so this is how I can orchestrate or do notebook chaining to execute one notebook from the another

### [30:00](https://www.youtube.com/watch?v=qzZvGfx16zM&t=1800s) Segment 7 (30:00 - 34:00)

notebook so this is the second method of kind of orchestrating an a scheduling execution of different notebook the third option which I guess you would find it very convenient and you'll often use it is called pipeline so under the data Factory workload of Microsoft fabric you will find an option to create a pipeline this is one of the example of the pipeline that I have created and in this case I have one simple copy data activity and I can come here and then I can include other activities so for example I want to include a notebook call activity so once my data lands from source to staging table I want to call this notebook and this should kind of go like this so this is one notebook and I can come here and I can choose what notebook that I want to be executed when this activity gets executed so likewise I can bring in multiple notebooks activity here I can chain them together or can I make them run in parallel or sequential or combination of both so we have full flexibility of defining how the workload should be and once that is done we can come here and schedule it again based on the need that business requires to do that so that's a third way of orchestrating and scheduling your pipeline Arad to recap can you comment what's the performance if we compare DW so the code written in tsql versus spark SQL or py spark you mention py spark equal to spark SQL because the same optimization is happening for the same queries but what about the SQL is there a way to compare it or it can like it differs what's your perspective I think it depends on the use case that you are trying to execute it's better to test it in your scenario and see what works best for you in both the cases or both the engines that we have in fabric they are kind of Highly optimized and we keep on investing into each of those area to give a choice you as a user to make a decision on so what performance that you see today might change over a period of time so I would suggest you to kind of run your test see what you are comfortable with if you're comfortable with SQL maybe you can go ahead and use data warehouse or SQL endpoint if you're comfortable with uh spark or you want to take it advantage of a spark run time where you want to take advantage of extensibility spark provides you can use a spark and that's pretty much uh it thank you for listening if I may I will ask one more question that will be related to one element you mentioned SQL analytics endpoint that allows you to access the data from The Lakehouse with using of the tsql it's a perfect tool to for example select some number of rows and create a view on top of it so can you comment if that as of now this endpoint allowed you only to read the data are there any plans to use the same endpoint also to write the data back to Lakehouse uh not at this time we use this SP for all the data manipulation in lak house if you want an ability to kind of manipulate data with the SQL I think data whereare housing is the place to go we want to keep this one engine writing so that there is a data con stcy so you can access the data across different engine so for example if you have a data that you have in lak house you can use SQL endpoint and take advantage of SQL to read the data likewise if you have a data that you have prepared in data warehouse you can read that data into a lak house having an ability for multiple engines to write to the same data might corrupt data so that's what we are kind of restricted at this time but uh yeah if we have any feedback maybe we can think about that awesome ARA thanks a lot for joining and sharing great demos great portion of the knowledge for those who are watching us remember to hit the like button leave a comment subscribe the channel share the video with your colleagues for those who are using and may benefit from the knowledge shared by Arad and until the next time happy using spark SQL or tsql Al just exploring which one is better for you for your needs thanks a lot thank you so much again thank you bye-bye

---
*Источник: https://ekstraktznaniy.ru/video/44748*