Job Queueing for Notebook in Microsoft Fabric | Spark Compute

Job Queueing for Notebook in Microsoft Fabric | Spark Compute

Machine-readable: Markdown · JSON API · Site index

Поделиться Telegram VK Бот
Транскрипт Скачать .md
Анализ с AI

Оглавление (6 сегментов)

Segment 1 (00:00 - 05:00)

hey everyone welcome back to our Channel dedicated to Microsoft fabric this is a series dedicated to topics related to data engineering data science and everything that is happening with Apache spark jobs within Microsoft fabric today Santos is joining me for another discussion related to jobs and how our Hardware clusters are handling huge amount of jobs that are happening at the same time today we are going to discuss job queuing Santos thanks for joining us and can you know give us the context about job queuing from the early beginning meaning that as a user I'm going to fabric I'm scheduling my job what's happening then and when I can Le with job queing hi everyone and thanks for having me again Mr really appreciate it so talking about job queuing right we've been having queuing support for spark in fabric since our public preview so this was when it was announced at build conts so if you're a data engine user or data science profession they could either run data Eng job or a data processing transformation job that could be orchestrated through a batch job or like a spark job definition in microf fabric initially when we enabled queing it was only enabled for batch jobs based on a lot of customer feedback we wanted to also extend this to notebook jobs notebook being one of the most common items using which users orchestrate jobs so the recent announcement what we currently enable is whenever a user submits any job that is meant to be triggered and automatically processed as a background operation think of it this way users could have scheduled notebook runs which could ingest data into lake house they could have subsequent Downstream runs which are going to like probably doing confirming checks standardizations for their Medallion architecture in their data right so in these scenarios they don't have to uh have uh trigger set up to uh constantly look for scenarios where their notebooks or batch jobs are running into throttling errors and having them retriggered uh this is an expensive operation because at the scale at which data platforms operate they could have like multiple thousands of pipelines and multiple thousands of notebooks within those pipelines being triggered during Peak usage RS so what happens is when users are running at Max Capacity their notebooks or bad shops they would automatically be added to the queue on the Park ler and these jobs are going to be automatically retried so if you remember we spoke about Job admission in the last episode of fabric we saw how the job admission happens this is the next stage right now the job is admitted and you have jobs running at your maximum utilization now you're submitting another job that is exceeding the computer we did briefly touch upon on this where we said jobs could either be throttled or queued so the job queuing part is more of a deep dive into the queuing scenarios Santos that makes sense so can you elaborate more on the concept of job queuing what's exactly this feature is changing the main new additions that we enabling as part of this feature is extending the queuing support for notebooks uh which are going to be triggered by pipelines uh rest apis or uh job scheduler so if you're a user you scheduled your notebook through the job scheduler which could be like based on a refresh frequency or say daily or hourly or even within a minute latency or if you have like pipl that are going to be having notebook steps or if you're having like uh notebook run apis that are going to be orchestrated through the processes in those cases the jobs will automatically be added to the queue and they will be processed automatically in terms of the activation how to know when it's enabled and if it's working for as I mentioned it was it is originally enabled for like spark job definitions batch jobs with this new addition we are extending this to notebook jobs notebook being a popular item for data engineering and data science activities so if you're a data engineer or data scientist and if you have scheduled your notebook for a daily or hourly refresh

Segment 2 (05:00 - 10:00)

or if you have rest apis pipeline activities which have notebook steps job queuing would actually automatic rce and would automatically process your jobs when the course become available for your Spar compute awesome so under what circumstances job queuing will typically become activated will work for me job queing is enabled by default for all works spaces and all Capac so if you are using data engineering items or data science items and you using spark in fabric you can use job queuing the job queuing would kick in when you are at your max utilization State say for example if you are running an f64 and I think we'll be able to like dive more into the details as we talk about throttling scenarios but if you're in f64 even like if you're using say 300 spark V course and you have a next job that requires minimum of like say 100 spark V course to start then it's not going to be able to like start so instead of getting a job rejection error as a user then if I get job rejection I just have to like keep monitoring for active jobs to be completed and Rel right in this case what happens is this just goes into a cute State once it you would be able to see the job becoming automatically picked up and uh starting to execute as the course becomes available so uh it's going to be completely automated so I just want to recap that this feature in the past has been enabled for spark job definitions that's correct yes now right now it will work also for Nom books and the experience without job queing is that as a user let's assume that we are reaching the possibilities of my capacity one of the example public f64 was that again there is predefined limit for this capacity if again I'm part of the team if we all will schedule too much job my job in the past has been rejected and I had to control check retrial with job queuing I'm just scheduling the job and once the resource and the limitation is not hit and resource are available then the execution will happen so basically again we are removing overhead from monitoring and from running their trials is rather you handle it you as a platform as a SAS tool in that case 100% right and it's going to be a huge operational overhead for customers to like when especially like operating at scale where they have these notebook jobs when they're automatically getting triggered based on a schedule or like from pipelines it's going to be hard for them to like Monitor and manage when these jobs are entering a queue and when they are actually like uh getting throttled because of Maximum utilization and they have to like wait and retry in this case it's all automated each and every Fabrics Q has defined Q limit I will share the blog uh link which would actually talk more about the skew level limits for the que so we have uh Q limits for every skew and once the jobs exceed the Q limits then they would get rejected the Q limits are defined based on every skew with the intention to just to make sure that it doesn't result in a scenario where the jobs are left in the queue for more than a day or so you don't want those also to happen because that could lead to like a longer job starvation right if there is one Rogue job that has been running for say multiple days you don't want all of your jobs to be queed uh for like multiple days that's going to cause a latency throughout your entire data processing flow the Q expiry is currently set to 24 hours so it is from the time the job is admitted into the so before we'll jump to the throttling can you share and confirm that job queing works for any Fabric and power ba capacity there is no limitation that fabric capacity of more than six like specific number are getting this feature it works for everyone that's a good one um it does work for all uh fabric capacities one thing that I would want to call out is the trial for Tri capacities usually trial capacities are more for proof of Concepts and for user level prototyping so they are not enabled with queing so uh you would not experience the automatic Q processing when you're using a fabric TR uh but all the other fabric capacities like starting from F2 all the way to f248 have uh queing enabled by default one thing that I would also want to like uh call about specifically and this is

Segment 3 (10:00 - 15:00)

something that we also like touched upon in the previous episode of optimistic Java admission the overall capacity dly right so say for example you have exceeded your burst limits Your Capacity is in an unusable State at that point submitting jobs and expecting it to be C may not work because uh your jobs will not even be accepted in the first place and you will not even be able to like open artifacts of like items like notebooks and those scenarios so in those conditions your jobs will not be queued so you would want to make sure that your capacity is back to a functional state or like an active state where you're able to like query the artifacts or view the artifacts and then that's when like the subsequent job submissions go into the queue and uh the queue processing is going to be taken care and from the capacity admin perspective what would you recommend monitoring the capacity I think that we have a decent monitor ing for capacity settings and thanks to it we can see if our capacity is functioning well or does some cuse limitations problems with it that's a excellent question yes I would recommend users all the capacity admins to monitor the utilization view in the capacity metrix app so that's going to give them the distribution across all the jobs and also like this per job level CU usage and also their overall trend in terms of their utilization rate right so if they are seeing scenarios where they are almost near their maximum utilization line that's the scenario where they would want to make sure that their jobs are being managed correctly there's not scenario where they have jobs which are over utilizing the capacity which could potentially result in like throttling production workloads one tip that I would also love to share and also other customers have been using this is using data activator over their capacity metrics app to basically set up alerting based on the utilization rate so they get alerts over emails when their capacity is nearing a certain threshold so they could make sure that they could go and adjust or at least contact teams who are overusing the capacity yeah makes sense good tips now let's jump into throttling what's the concept of throttling in the context of apachi spark and how to know if I have been throttled job queuing and throttling are like uh they come together right so in this case just some background context as you all know the primary way in which you could get started on fabric is you could acquire a fabric capacity or fabric TR from Azar and uh once you have the fabric capacity you would be able to size the capacity based on your workload requirement so as you can see from the you are you can see that there are different capacity skes and for every capacity skew you get equivalent capacity units one CU from a capacity translates to two spark vors the data engineering data science workloads in fabric are powered by spark and the spark compute runs on this fabric capacity as all the other workloads run on when it comes to like larger big data workloads or data engineer or data science teams who have strict SLA or who are trying to like migrate their existing dat engine infrastructure to fabric they would have strict core requirements say some teams they would want at least th000 spark VOR in a day so for those scenarios I'm sharing this mapping and this table so that they would be able to map their requirements to their overall capacity skew that they could start with benchmarking and testing their workloads to start using fabric at scale you have these capacity spark VOR that you get so each and every starting from F2 it translates to four spark V course like because one CU gives two spark V course and you have a burst Factor that's being applied the bursting is something that we talked about during our optimistic Java admission right and the reason why we have bursting is in fabric we want to make sure that users get to do more with what they have or like give them the best computer utilization but bur thing think of it more as borrowing from the future right and for Fabrics spark the burst Factor across all capacities skes that's applied is a 3X burst F2 being a special case where there is a lot of customer feedback we want to support startup pools so with a minimum node of a node size of eight spark V course for startup pools you have like maximum spark VOD limit as like 20 uh but for all the other fabric skes you actually have like a 3X burst so say for example like f64 which is a common skew that users use for their data engineering workload

Segment 4 (15:00 - 20:00)

in this case 64 * 2 you get like 128 spk B course and on top of that a 3X burst has been applied which gives like 384 spk course now the bursting is more in terms of borrowing from the future as I mentioned so in this case the 3x limit is going to be say for example you're running jobs it would allow you to use at any point in time 384 spark we course within your capacity and say you have like pockets of inactivity during the later half of the day after you run the job this excess usage uh that you have like done in this short period of time would be smooth over this periods of inactivity so that your overall utilization stays below the max limits once you have submitted jobs that max out on the max partk V course limit that is when you would actually be experiencing queuing which we saw like a while back or in the cases where there are interactive jobs so interactive jobs as in say your user who is querying notebooks or who are performing operations on a lake of staes like you're trying to do a load to Delta on a convert to Delta operation on a lake of stable right these interactive operations will be throttled with throttling error where you can actually see it's stating that you have reached the max compute limit or like Spar compute limit and you would have to go either to the monitoring Hub to free up the course to check if there are any active jobs that are running that are taking a course or you could bump it up to a larger capacity so that you have larger cores available for you to like concurrently use thanks for sharing now I have a few questions and also want to do exercise simulation but again starting as last time also said we had a fabcon a conference focused on just Microsoft fabric that happened in Vegas in March by end of March this year the during fall we are going to Sweden to have another conference in Europe that again we are inviting you to join and there we also are going to have a workshop the same that happened similar that happened in Vegas related to building the first endtoend Lake housee solution it means that inside one room we have 200 enthusiasts who want to build fast andent Lakehouse solution all of them will be as part of one tenant we can do the simulation this is one organization then for a smoother experience what we did we split that group of people into few capacities then for every capacity we have to know what's the size so that's the first requirement that as a user of Microsoft fabric we have to know if we have P capacity powerbi skew or fabric capacity and then what is the size in fabric we are paying and buying capacities once we have a capacity and we know what's the skew for it then we can calculate the number of vors that this capacity can handle based on that let's assume that we splitted 200 people into four groups so in every cohort that's the name that we gave we had 50 people then every person that is starting to play with spark jobs through notebooks through load to table through pipelines through different mechanism is spinning up job one job but the one job can have multiple notes so Santos can you give a guidance how to avoid or maybe handle throttling in that scenarios then I can share later on how we did it excellent uh scenario actually and that's a pretty common scenario in like an Enterprise data team right if they are having the same use case where if they have multiple teams consuming or like sharing a single capacity so one thing to like start with as you mentioned earlier to understand the max core requirements at any point in time right the peak usage at one point in time what is going to be the maximum course that you could use Are there specific jobs that are kind of like outliers but have uh super high core usage compared to like your overall average use cases or average core usage throughout the day that's where the job level bursting comes into place so say your average usage is going to be around 128 part okay but there's just this one job that you're going to be running like every 4 hours or every eight hours and that's going to require 300 sparkk people then you don't have to like go for a capacity skew like something that

Segment 5 (20:00 - 25:00)

would fit 300 SP V course like f256 you can still use an f64 you can leverage job level bursting so for this particular job alone you could set up a pool that has a max node limit that could go up to 384 spark we course so in this case 48 executors you could set it up with an auto scale or 48 and then uh some of the job right so having a pool dedicated for that when you're capacity ad if you want to make sure that you are having efficient mechanism to like achieve the best computer utilization possible this particular job is going to be like isolated it's going to be using the maximum burst and it's going to run without impacting your overall capacity you're not going to have like a dedicated capacity standing uh to support this compute heavy job for other scenarios if your average usage is going to be around like 100 sparos or like 128 sparos then you could just use like an f64 if your average usage like what you mention like in the case of the lab where you can have like 300 plus users constantly like trying to query and you would want the best experience and this is something that's very possible uh in scenarios where in centralized data teams who are going to be serving multiple sub teams for reporting or uh for data science purposes then in those cases it's better to go for a capacity that is like f248 or f1024 where you get at least like 6,000 spark V course so if you at least want like 100 spark V course per user you can at least have up to a team of like 50 in a concurrent uh way so you don't run into you're still within the uh utilization range you still have capacity available for other operations across like other fabric workloads like uh data warehouse powerbi and you also have zero disruption or queuing or Ling that could possibly hinder your data science querying experience or like lake house quering experience makes sense so what we did uh we did again we splitted users into capacities every capacity was based out at minimum P1 P2 but here I want to make a point that there are differences between powerbi skewes versus fabric skewes some features are exclusive to fabric Skuse so for those who are watching us please remember just to check the details on that then uh we calculated number exactly what they are going to do recalling and knowing that some operations like you mentioned low to Delta once you complete low to Delta you are getting a session and a job that it's still running for a few minutes so again we use monitoring C to kill some jobs like once operation was done basically during the workshop we teach participants how to manage and optimize the resources that are working for us at the end at the company we are minimalizing the cost with it and again that was the solution uh we were shuffling different fabric capacities at the same time there is always an option to scale up and scale down and that's because again we can the most important thing too right like as you rightly call like setting up the compute is the first and the most important step um you want to make sure that the compute definitions are based on your workload requirements say if you're going to be running a lot of ad hoc small data set based or small data transformation activities then you could start with the single node is support is available in Microsoft fabric so you could start with a single node and from there you could in that case that's going to like maximize your user concurrent users to like Beyond like multiple hundreds in this case if you're going for some larger capacities like f256 so sizing the compute based on like average core usage would be like super critical exactly so that was the first exercise after the exercise to login into Microsoft fabric then go to workspace settings and because we knew what's the data set that we are going to process we ask explicitly again we don't need autoscaling from two to eight notes we just need a single note and in our case we gave every user every participant two notes because that was more than required to complete all the exercises but it's showing the real case because again we are getting tons of questions how to calculate the total overall cost for fabric so we have to start with counting and estimating from the jobs then counting the usage so spark viors and then two spark viors one fabric capacity unit and then we should have the final number and I really love that bursting feature because again

Segment 6 (25:00 - 27:00)

borrowing from the future that's a great conclusion great name for that functionality Santos thanks a lot for joining it was a very pleasant discussion uh I'm looking for the continuation of this studio that is focusing on optimistic job admission now job queuing what's next what's the next topic you want to discuss um I think you had a good introduction for the next one right so next one would be and now that we have seen all these things it'll be good to also have like a refresher on like The Spar computer on different lels so we also have some new features on capacity settings for uh compute Resource Management so walking users through the different hierarchies right uh and different bers so capacity admin how you could uh have compute governance creating capacity pools and then um work space pools like walking through as we mentioned like setting up the pools that's where it all starts uh that's going to be a super critical step because that's going to power all of your data enging and data science work getting that right is going to allow you to scale in know more better way and then going one level deeper it's always like capacity workspace and then within a workspace you have environments and then within an environment you could have like multiple notebooks and then have like session level personalization right the computer personalization can also be done at the session so I think it'll be really uh useful if we could take users through all these four different layers and uh share like some of the new updates as an Next Step that I think that should complete uh the overall SP Compu topics in terms of compute management ands that's awesome I'm so excited and can't wait to meet and talk to you again and thank you so much thanks a for those who are watching us remember to leave the like button leave a comment or leave a question and until the next time don't worry we have job queuing so just happy Spinning apach Park jobs thanks a lot

Другие видео автора — Azure Synapse Analytics

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Дайджест Экстрактов

Лучшие методички за неделю — каждый понедельник