# High Concurrency Mode for Notebooks in Pipelines for Fabric Spark

## Метаданные

- **Канал:** Azure Synapse Analytics
- **YouTube:** https://www.youtube.com/watch?v=AF6HUl4wImM
- **Источник:** https://ekstraktznaniy.ru/video/44740

## Транскрипт

### Segment 1 (00:00 - 05:00) []

hey everyone welcome back this is a new episode of fabri espresso this is the series about data engineering data science and today I pleasure to share that I'm hosting Santos who is joining us directly from fabric data engineering product group Santos thanks for joining us hey estra hey everyone thank you so much for having me again super EX decided to be back on fabric espresso that's awesome so can you recap what are the recent feature from the platform perspective from the spark computer perspective that you with your team specifically shipped uh it's amazing actually a it's going to be a lot of features uh the monthly blogs are going to be covering these in detail but some of the key highlights include the latest runtime fabric runtime 1. 3 uh which is generally available the native execution engine that has been the most exciting thing so the acceleration that it provides for queries and also additional control for customers to enable or disable it at the query level gives them the flexibility and uh we also have some cost optimization features things like high concurrency for pipelines so they could share sessions across different notebooks when they are having these scheduled runs within their workspace and uh there are also other platform features which we recently announced for gent availability like couple of months back for network security for tenant level private links and uh enabling secure connections from fabric workspaces to data sources through manage private endpoints so a lot of features across the stack so it's been I know it's been a while but uh super glad to be here and share these updates with you all yes and all the updates are described as part of our monthly blog post every month we publish a very long kind of a book because it has like 100 pages and uh there is just a sneak peek about the changes and you mentioned the topic for today's episode that is high concurrency mode in fabric data engineering so I would like to kick off from the context perspective like what is high concurrency why we need it and what is changing because we recorded the episode about high concurrency mode for notebooks so what's new yes we did and that was a very interesting discussion and uh I also heard a lot of feedback from customers after watching the video the main thing that they asked was hey um now that we have session sharing so high concurrency mode is more about session sharing right running spark sessions in a shared mode not compromising on security still it's still within the same user boundary and um allowing uh users to get the max maximum resource utilization it also gives them advantages in terms of uh balancing their performance and also getting a higher resource utilization and getting to run more with what they're paying for the main feedback that I got or I would say more of an ask from different viewers of your channel was when can we get this for schedule jobs because yes interactive queries are the way in which users initially start working on building the data products or standardized entities in any data engineering platform but then the actual concurrency kicks in when they have these massive loads that they process and they transform these data at scale uh through pipelines for their production scenarios we have heard a lot of feedback and ask from different Enterprise customers so super happy to like share and walk you all through in this episode so let I know it's been a while since we talked about high concurrency right so thinking of high concurrency there are three main points that I would want to land first thing that I would emphasize is uh security the session sharing boundary is always within a single uh users context there is no way that multiple users could land on the same uh cluster so avoiding all the threat vectors where they could exfiltrate tokens or uh data access paths the next one is how this would actually allow users to multitask or concurrently use the same session so allowing them with uh faster session start so say customers are going with larger pools they are having custom uh jars or different libraries the subsequent sessions are going to be using these if they match this criteria and they're going to have like a 5sec session start experience giving them the perf boost eliminating the session start

### Segment 2 (05:00 - 10:00) [5:00]

duration the next one is of course it's going to be cost effective because you're going to be only uh running a single spark application where you could actually run like five jobs at once giving you the benefit of savings on cogs that's said um just a recap on the high concurrency for notebooks that we talked about in uh in fabric espresso a couple of months back this is generally available users can go into the notebook and they could uh select an existing High concurrency session if they already have one running if not they could start a new one once they have started a new one the subsequent sessions that get added to this are going to be starting within like 5 Seconds which gives them the advantage in terms of faster session start and once these are running they will be able to see these uh runs through the monitoring Hub experience in the monitoring Hub all of these notebooks are going to be uh packed into the same session but in the related tab as a column option uh users would be able to see each and every job execution each and every statement that has been executed from notebook it gets attributed to the notebook from which it is being run within the spark application the session start experience is going to be only for scenarios where customers go with uh Custom Pools uh if they're using startup pools they are anyway going to get the 5c session start experience so uh it would be in the cases where they go with large or ex extra large or small pools uh for their edas or exploratory data analysis scenarios moving on to high concurrency for notebooks in pipelines so in pipelines users leverage notebook activities for their schedule runs and uh this is considered as one of the most used approach for Enterprise customers for orchestrating and managing their production workloads across different data sources to like inest data from different sources and transform them in their lake house enabling this experience is going to be super simple the high concurrency setting is available as part of the workspace settings I'll show you in a bit how you could actually try it so when it's enabled what happens is Every Spark session is going to be capable of sharing subsequent notebooks unlike interactive notebook sessions where users actually have to go and choose and opt in uh and select which session to choose from uh it's not going to be the same for background jobs or like schedule runs right because no one's going to be watching these uh the system automatically packs them uh based on the session matching criteria the criteria is going to be the same as interactive notebooks where the notebook should have the same default lups the file system dependency and should be within the same user boundary of course and the same workspace boundary and it should also have the same compute configurations and Library management dependencies now I've also heard this from other users saying that they would want to have a little bit more control on which are the notebooks that they would want to group together they don't want all the notebooks to be shared in a common manner uh there could be certain notebooks which would have different SLA requirements or performance requirements right so we also have the session tag feature you can see in the uh GIF it's available as part of the advanced settings options within the notebook activity once you specify a session tag the session tag also becomes a matching criteria so a session with the same session tag is going to be able to host other not books that are coming in with the same session tag so it gives more additional control for users to pack and group these notebooks with that said let me quickly show how users can uh enable this experience so I have a fabric workspace now I go to the workspace settings the first step I enable the high con currency mode I see option from the spark settings navigate to the highcon currency tab enable the high con currency mode for pipeline now after I save The Spar set settings Now navigate to a pipeline that I have created here which is going to be used for this T in this pipeline I actually have for each activity which has a set of notebooks that are going to be executed concurrently as part of these notebooks in the settings you can see under the advaned settings tab there's an option called as session tag now I'm going to like specify a string or a GID that I would want to use to pack these sessions together let me name this as high concurrency session once I save it I do the same for all the notebooks now that I have saved I'm going to start triggering this pipeline we can see that the runs are now being triggered now you can see all these

### Segment 3 (10:00 - 15:00) [10:00]

applications that are running with the same session ID which means that they're actually shared and you can see that Notebook 3 Notebook 2 and notebook one are concurrently executing when I click onto this session you can see it's going to have the same session ID and when I navigate to The Notebook tab I'll also be able to see the different notebook statements that are being triggered and that are being processed Santos thanks a lot for doing the demo now the best part a few questions so for sure you convince me to enable it because of costs instead of paying for multiple sessions I paying for one which I can reuse so cost optimization but now convince me it's secure yes so let's talk more about security part right so anytime um as I mentioned the session sharing boundary is still within the same user context so if you and me are in the same workspace say you start a spark session uh you started as high or you're running notebooks it's going to use your identity I also trigger the same workspace I trigger the same pipeline keeping everything else the same what would happen is your spark sessions are still going to be like dedicated to yourself so there's no way in which I would be able to like run my code or run uh the pipeline the same pipeline that you're using to inject myself into your uh spark session so it is always dedicated so the secrets or tokens that you manage within your spark session is still going to be protected and it's not going to be exposed to other users who are using this feature uh or having their scheduled runs you're running in a shared mode uh in any case makes sense now this is about the user right as a user I can share you reuse the same session again we ship that for Notebook now for data pipelines if I'm a user who is working in let us assume that uh like multiple places so I have multiple workspaces this is the typical job of system integrators right then is that functionality bounded to the workspace as well so to the user within the workspace or to the user but within the multiple workspaces if you can clarify that the control is more at the workspace level uh so in this case I would uh recommend the user to actually enable this setting in all workspaces in cases where they would want dedicated sessions I would recommend the users use session tag to have dedicated runs and in other scenarios they still get the maximum utilization through this session sharing feature which is done by default right so you still get to optimize on cost and also get to uh trade in for your price uh Performance Based on your uh priority like for certain jobs you want them to be dedicated put them on a dedicated session tax so it runs in isolated mode is not sharing Resources with any other notebooks makes sense and just to confirm it works for also data science part data engineering and data science yes awesome and for there is a new functionality that we announced recently python notebooks so a notebook without Spark Run time just with python clear minimalistic setup then for that functionality because there is no Spar session then we do not have to enable anything because that this feature High concurrency works for Apache spark sessions that is correct yes it is for spark sessions and uh since this is more at the U session context level there are scenarios where customers could have like different code different languages within the cell if it's still running within The Spar context then the sharing applies is that language agnostic yes it is are there any implications related to security features like a manage workspace no actually in this case it would give you benefit I would say because with manage vets another customer feedback that we have been hearing is more in terms of slow session starts because they don't get startup pool because it's in a shared tenant model happy to discuss more about this in a more dedicated call for like network security but in this case there's more of a work users can create spark sessions and also have the subsequent sessions going to be faster so they still get the they'll be able to mimic the session uh startup pool experience in manage vet enabled workspaces through high currency one so that's a secure way of still getting this but uh the first session is going to pay the uh price for like initial session start of three minutes and regarding the pricing are there any additional cost Associated Ena not at all it's only more in terms of cost benefits that they would get so why not again what's the reason not to enable it

### Segment 4 (15:00 - 16:00) [15:00]

so we encourage all of you to try to test and to enable it because it's just uh time and cost saving for you Santos thanks a lot for joining sharing can't wait to record a session about network security Yes again security is our fundament and that's our PR priority we are focusing on and that's thanks to it we are building a trust with our the biggest customers around the world so for all who are watching us thanks for coming thanks for watching and remember to leave a comment suggestion question hit the like button and until the next time just enable High concurrency mode and uh pay less thanks a lot thank you so much thanks estra thanks again for having me really appreciate it