Apache Spark on Kubernetes: The RIGHT Way (No Master/Worker Clusters Needed)
57:54

Apache Spark on Kubernetes: The RIGHT Way (No Master/Worker Clusters Needed)

CodeWithYu 27.01.2026 2 812 просмотров 91 лайков

Machine-readable: Markdown · JSON API · Site index

Поделиться Telegram VK Бот
Транскрипт Скачать .md
Анализ с AI
Описание видео
Run Apache Spark jobs on Kubernetes with ZERO permanent infrastructure! In this comprehensive tutorial, you'll learn how to deploy a production-ready Spark setup that creates pods ONLY when jobs run and automatically cleans up when done. Say goodbye to costly always-on Spark clusters. 🎯 WHAT YOU'LL LEARN ✅ Deploy Spark on Kubernetes WITHOUT permanent master/worker nodes ✅ Build custom Spark images with embedded PySpark jobs ✅ Submit jobs that auto-scale executors and self-cleanup ✅ Run real-world analytics: customer segmentation, cohort analysis, revenue trends ✅ Set up the Spark History Server for job monitoring ✅ Implement proper RBAC security for production ✅ Debug and monitor jobs using kubectl and Spark UI TIMESTAMPS 0:00 Introduction 1:37 System Architecture 5:48 Setting up K8S 8:10 Setting up the project 10:00 K8S Namespaces 11:45 K8s Service Accounts, RBAC 17:27 Creating Spark Jobs for K8S 26:40 k8s Spark History Server 34:24 Spark Control Dashboard 42:42 k8s API layer 49:52 Spark Dashboard, Job submissions and review 56:52 Outro 🔗 RESOURCES & LINKS FULL SOURCE CODE - https://buymeacoffee.com/yusuf.ganiyu/source-code-spark-k8s • Apache Spark K8s Docs: https://spark.apache.org/docs/latest/running-on-kubernetes.html • Kubernetes Documentation: https://kubernetes.io/docs/ • PySpark API Reference: https://spark.apache.org/docs/latest/api/python/ Like this video? Support us: https://www.youtube.com/@CodeWithYu/join #ApacheSpark #Kubernetes #DataEngineering #BigData #PySpark #DevOps #CloudNative #K8s #DataPipelines #ETL #Tutorial

Оглавление (12 сегментов)

Introduction

For most people that have worked long enough with Apache Spark, you know, generally the challenge you have aside from the data processing is the compute itself where you have your Apache Spark compute hosted. If it is not on the cloud, then you have to set something up by yourself. maybe on a EC2 instance or your Kubernetes cluster something like AKS on Azure or any other platform uh where you have to host your own specialized compute server on Kubernetes. Now in our case we are going to be taking on the challenge on hosting a Kubernetes cluster that comes shipped along with Upstack or Docker dashboard on your local laptop and this can be applied and scaled up to production on the cloud as well. But as simply as as simple as it sounds, we're going to have something similar to what you have on the screen, we have a dev laptop, which is where you're building the images that you're going to be submitting and push this to your container registry if it is on the cloud or your local system which is generally your compute registry. You have this there. Once the image is built, we submit this to our Kubernetes cluster which is going to be an interface uh to our original Kubernetes cluster which we're going to be setting up and discussing a little bit uh down the line. We have a Kubernetes cluster we submit a job into submits it to the driver and the driver submit it to uh spin up multiple executors to execute these jobs that you submitted. To get us started with the architecture of the system we are building today, we are going to start by setting up our Kubernetes cluster either on upstack or on docker desktop. This is what the

System Architecture

architecture will look like. So let's start. So our system architecture would look like this. So we're going to start by having our laptop which is going to be uh of course the developer uh that will be you know taking up this project. So this is going to be our dev laptop and inside of this our dev laptop is where we're going to be building the images that using on our do. So we are building the images and also we're going to be using this to submit a job. building uh images and submitting. Yeah. So, this is the first version of what we're going to be doing. Now, we're going to be doing this against the Kubernetes API server. So, this is where we're going to be sending all our requests to. So, if you build an image and you want to submit, you submit this to the Kubernetes API server. And now with this Kubernetes API server, this is going to be fanning out into multiple executors. So we start we're going to be starting with just two executors in this um project. Um and you can increase this to as many as possible, but we're going to have one primary driver in here. So our driver would look like this. So our driver port is going to be the master. So the master is where the jobs will go to. So when you submit a job, it goes to the API server. API server triggers a driver to you know to do the submission and once it the driver picks the job up it's going to now determine how many executors should I exe should I spin up for this particular project. So essentially our driver pod is going to have uh in here and we're going to have our executor pod. So this is going to be executor pod. So it's it can be from two to n as many as your system memory can take. So we're going to have all the task going to this. So our task will be going to that and once the executor finishes it's going to have results back to the driver port. So essentially our driver port is going to be our Python job is going to be here. Our context is going to be here and our schedule job scheduled jobs will be there. So this is the super high level in simplistic form of the architecture of the system we'll be building today. So you again you have a the images that you have. So it's going to be so this is the first one which is going to be one and this is going to be two this is going to be three and this is going to be four of course. Yeah. So we do the submission. We build the image and send this to our Kubernetes API server. The driver picks it up. Send it to the executor as many as possible. and then the uh the executors execute the jobs and send it back to our driver board. So this is the simplistic uh architecture of the system at all. All right then. So the first thing you want to do is set up your Kubernetes cluster. So I'm currently running upstack and if you're using docker desktop you might want to do a different approach. Let me know if you want me to go through that. Um in fact let's even go through that. So with Upstack, you just need to go into your Kubernetes ports or services and you just need to click on either of them and turn on the toggle for the ports um in here. So if you turn it off, it's disabled and on, then your Kubernetes um cluster ports and services are up and running. Then if you are on something like a Docker desktop, so you just want to go to Docker Desktop in this case.

Setting up K8S

Most likely for Windows users and some Linux, you probably will be using the Docker engine anyways. So, so you just need to go into your settings in here and make sure um your Kubernetes in here is um enabled. So if you enable your Kubernetes here, click on settings, uh click on Kubernetes and then turn it on in here. So you can use cube ADM or kind as a cluster provisioning method that you prefer. I'll be using cube ADM in this case. By default, that's what we use. Apply and install. Then it's going to start the Kubernetes cluster on your Docker desktop. If you don't have it before, it's going to get this installed and everything gets running. So this is the more of the simplest way to enable Kubernetes on the cluster on your local system. But if you're using something like the cloud, it's a little bit tricky because you have to set your context to and the name spaces and everything to the Kubernetes cluster on the cloud. Let me know if you want me to go through that in a separate video. All right. Now that our Kubernetes um is currently running on our Docker desktop, I have a couple of them that is currently uh in here, but I won't be using the Docker desktop uh simply because I have um some other stuff running on my um Kubernetes uh for Docker Desktop. So, I'll be using Upstack instead. So, I just wanted to show you what that look like if you're using if you would decide to use uh Docker Desktop. All right. So in here it looks like this. All right. So we don't have any name space. We don't have anything on our local system. So let's get started with the uh solution and get and see what that look like to set up our spark on Kubernetes. So the first thing you want to do is set up your project. So you start with a project and you move it in here. Um so this is what your project will look like. So you can decide to start the project and call this Kubernetes Spark. Yeah. Or Spark Kubernetes. Maybe Spark Kubernetes. Uh sounds better. It doesn't matter what you call it. I mean it's the content that really matters. All right. So I'm using I won't be using 313. I'll be using 3. 10. So you can decide to use any

Setting up the project

other thing, but I'll be using uh 3. 10 in this case. you can decide to create a Git repository. Oh, just in case you're wondering, I'm using PyCharm in my local system. So, I'll just click on create. And this is going to bootstrap um my my PyCharm. It's going to bootstrap my Python environment and all that. So, open the terminal in here in case you want to use that. And uh increase the size a little bit just for readability sake. 25. It seems okay. Good. Now, now that your your Spark Kubernetes is live and up and running, the next thing you want to do is set up your environment. What you want to look what you want it to look like. So, I have um some templates that I'll be following in here. So, the first thing you want to do is um create a Docker file. Um touch Docker file in here. So, in our local directory, we have a Docker file in here. So I'll just delete the main in here. Then I'll be creating four different um folders. U MD I'll have templates, jobs, cubernetes, and um scripts. Now these are the four directories that I'll be using to set up the project. So jobs is going to be the sample jobs that we want to create and submit into our Kubernetes cluster. and K8S in itself is the Kubernetes configuration files and everything that is required in there where they'll be sitting. We have the scripts and we have the templates as well. It's time to set up our Kubernetes cluster. So the way we do that is to start with the um Kubernetes directory

K8S Namespaces

and create a few files, the YAML files that we'll be using in our case. So I'll just create a file in here and call it this is going to be namespacey and uh inside my namespacey I'm going to have API version and this is going to be version one the kind is going to be namespace um the metadata in this case will be that the name uh for this particular nameace will be spark so in case I want to delete or do anything with this particular namespace I'll just use this particular name which is spark to maybe delete the name space and all that. Um so in case you want to check what is currently available on your Kubernetes cluster you just use cubectl um get name spaces and then you see that you have default cube node I'm just going to zoom this in a little bit. So we don't have anything like spark in our local system default cube node le public and system. So these are the available name spaces in our Kubernetes cluster. So with this we can create our spark name space as well. So the labels that will be required in this case is going to be app Kubernetes name is going to be spark and the component is going to be processing. All right. So that's our name space. The next thing is to get our RBAC in which is the uh roase access. I'm going to have RBAC. YML yl and in my rbac. yml DML I'm going to have similar stuff. Um API version is going to be V1.

K8s Service Accounts, RBAC

Then I'm going to have the kind as service account. Metadata is going to be name of Spark. Um name space is going to be Spark of course. So uh this is going to be like that. Yeah, that's fine. Now um so that's going to be my service account. So I have a service account that will be attached to the name space that I'm creating. And this is very important because anytime we want to submit we use this service account to help us um you know hold everything together more like a user account where you log into a particular platform and this user assumes the role based access for this particular um user. All right. So um my yeah my API version in this case will be version one as well. So the kind in this case will be row metadata will be that spark row will be the u metadata there. I'm just minimizing that and then the name space of course is going to be spark as well. Okay good. Now we we're left with the rules. So the rules that will be attached to this particular rule. So think of it like you have a user and you have a role. Now what will the role be able to do that is attached to a particular user. So think of two people that one have admin access, the other one have user access. Of course, the admin will have much more permission. So what we're trying to do in this case is to create a simple role that we can attach to our spark um service account that will be attached to the name space in itself. So we can all glue this up together. All right. So the API groups will be that and then some of the resources that will be required in this case will be something like uh ports. Okay. Ports, log, services, config map. Um, I'm going to add persistent volume claims. Persistent volume claims in there. Um, should we add anything? I think that's fine. Um, yeah, my autocomp completion is crazy. So, we have create, get list, watch, delete. Okay, that's fine. And then I'm going to have delete data, delete collection, update, patch, and let me see. Yeah, that seems to be all I need. Then I have my API groups. Again, um another role or permissions that I want to be assigning to this particular role itself is going to be uh pod execution. So uh we want to be able to you know execute using the ports that will be created uh with uh Kubernetes ports. So if you submit let's say two jobs or three jobs we spin up separate ports for each of this submission. So we can easily track the end to end life cycle of this deployment. So that's what this is about. All right. So the resources in this case will be ports execution and the verb is going to be create and get. So we create and then we retrieve the logs from that port. All right. So that's the um the role in this case. Now let's do a role binding uh for this. So we have API version same the kind is going to be role binding and I'm going to zoom this in a little bit we have metadata we have the name is going to be spark row binding name space is spark then we have my subject the service account that I'm binding to the name space and then I have row reference so the row reference in this case is going to be row and then I have uh the name is going to be spark row the API group is going to be I'm just going to move this back. API group will be um RBAC authorization kio and that's all I need to do in my role base access. So before we continue let's apply this. All right. So the simplest way to do the application or for you to apply this is just to um you know you have your cubernetes directory. So you just say uh cube ct cube ctl apply. Then you have your f um minus f which is specifying that you want to apply uh the a file into your kubernetes cluster. Then you have kubernetes then namespace. gml. This is like the simplest way to apply your configuration. Another way is to just copy paste your configuration file into your terminal but that's not clean and it's not repro reproducible. So you want to stick to something that you can easily just change, reapply multiple times and you're good to go. All right. So in here you have your name space. So if I do cube ctl get name spaces, you can see that I now have um spark which is active about 6 seconds ago. So let's apply our role base access as well. Now you have a service account created um spark row created and my spark row binding created. So that means our configuration is valid. Um, okay. So let's continue. Now the third thing we want to do is now create uh the Spark jobs that we want to be sending to our Kubernetes cluster. All right. So um to do that I'm going to inside of my jobs directory um maybe create a simple job counter in this a simple counter in this case for our Kubernetes submission. So I'm going to just create a Python file and I'll call this simple s simple counterp py. All right. So of course you know that you need to install pispark. So um from pi spark we don't have it. SQL from pispark. sql import spark session. So I'm just going to do beep in. In fact no let's create um inside of our template. This is where

Creating Spark Jobs for K8S

we have a dashboard visualizing everything. So let's create a requirements. txt in here. So we can just apply this and be done with it once and for all. So in here I'm going to have flask. Flask is going to be 3. 0. 0 or we just use 3. 1. 2. And then we have Kubernetes. Kubernet Kubernetes in this case 28. 1. And then you have G Unicorn 21. 2. 0. Good. — [snorts] — Now uh I just do pip install uh and I'm going to go to templates uh requirements. txt. So I just install that. Good. Now I have my kubernetes uh properly installed. Um pispark of course is installed. So the scribble should go away uh after the indexing. Otherwise, if it is not going away, you just need to apply uh your interpreter settings. Go into in this case um my local interpreter. Select existing. So, this is my spark cubernetes and apply that. Okay. So, which Python? Yeah, this should go away, but it's not. So, it's fine. We know our syntax, so it should be okay. So, um All right. So, I'm going to be importing time in here. have dev main. So, easily I can just say spark simple counter just to like decorate it. Um then I have my simple counter get or create. uh as simple as that is. So you just have your spark context is going to be that and then set log level to be one and so you don't want it to be too verbose. So it's easy you can easily read what is going on. You can use debug if you want to go deeper into what the logs do and uh individual line item is printed but one is fine or even error if you want uh in this case. So I just have my spec version the application ID and the master where this is going to be sitting in. So I'm going to have my start time. This what this was why I wanted to import time before. So my start time is going to be uh time in here. So I'm just going to import that. Import the name. Then I'm going to have num number of elements. So num elements. So let's have 1 million. Then the number of partitions that I want to partitions this into is say let's say 10 10. All right. Um so the only thing I want to do is create an RDD with elements element across the partitions and and all that we have RDD paralyze this number uh number of um partitions in here. So I'll close the bracket. All right. So let's perform the computation. So computation is here. How did it count? That's it. And then the total sum in this case we do the average we do the min minimum um min minimum and maximum calculation in here then we print that as well and then if you want to get the odd and even um we can do this uh just count that even and odd that's fine um yeah we are now in the AI era so most of the job is going to be just tab key you have data frame sometimes you have to just use the AI to do everything in itself so data frame uh example if you want so you can have um to perform some aggregations in here so you can have df is going to be spark dot create data frame and inside of this data frame you can have something like this in fact I'll just have my i * to else. Yeah. Then I have my comma in there. The schema that I'm going to have is going to be schema number double parity. Um, nope. Uh, let's change this. So instead of number I'm just going to have doubled on the type that should be fine. In fact I don't need the schema in itself just put it like that and that should be okay. All right so I just print the sample data. Okay good. Let's aggregate by type. So I'm just going to have print aggregation by type group by the type count. So you can have group by type then you do some aggregation on top of that aggregation. So let's break this down. So the number is going to be sum not count. And I'll explain what I'm trying to do in here. So double is going to be average AVG and I'll show that. Then the elapse time is going to be that. And then I can just do um print my this and then print spark counter completed successfully and the last time is that then spark stop. Okay. Now if the name is main that's all. So that will be our simple job counter in this case. All right. Um yeah that's fine the scribble is gone now. So uh okay so now that we have our simple counter in place the next thing will be to just get our docker file that we'll be using to submit a job into our cluster uh ready for submission. So it's just to like uh we get our from in here. So we have Apache Spark 3. 5. 3. You can use any other versions that you prefer as well, but it's okay to use 3. 5. 3. Um so I'm installing pi spark. Then I run make directory opt spark job. I make it 755 readable and writable in this case. So copy jobs u from my jobs directory u which is the simple job counter that I want to use into my opt spark job then I just set the permissions in this case 755 I already did that but it doesn't hold because you you're moving some in here to I don't think this is even necessary anymore but so the user is going to be spark and then the work directory is opt so this is the current directory where you're in so if you want to submit any job you can just do it directly from this directory. So we only have two things left to do. set up our Spark history server where all our running jobs and past and uh you know completed jobs will be going into and the dashboard that we can use as a control plane for managing and managing the running execution you know cancelling deleting submitting jobs to the Spark cluster instead of having to manually run this uh every single time uh on our local system. So uh let's continue with that. So now that our Docker file is ready, um we just selected from Apache in here, select the root, uh install the requirements in here, copy the spark jobs in this directory, and that's it. Uh that's all we need to do at this point. Every other thing that we need to do or to run uh will be taken care of directly from the UI. So you don't have to manually run this. I'll show you how you can manually run it as well in case you decide to do that. Uh the next thing will be to set up our Spark history server. So I'm just going to go into my Kubernetes directory and then create a new file and call this Spark history servery. All right. Uh it should be YAML. Doesn't matter which one you call it. It's just for like the naming convention part. So yl or yml works just fine. So it's just for like uniformity sake. All

k8s Spark History Server

right. So I'm going to have API version. It's going to be v1. This is going to be my persistent just to put this into perspective. Persistent uh volume claim for my spark history server or my events logs. The kind is going to be persistent volume claim. The metadata in this case will be spark history pvc. uh name space of course is going to be spark like you already aware the spec which are the properties that will be attached to this PVC in this case is going to be access mode we have um read write ones then we have resources and if you decide to like add additional access mode to this let's say you want to add something like uh read uh write many or whatever it is you can do that as But in my case, uh that should suffice. My request is going to be uh 1 gig storage. Yeah. Uh on a standard Kubernetes cluster, this is going to be handled by your deployment team. So you don't have to manually do this. The only thing you need to do is just find a way to get your Spark operator in there and start submitting. But yeah, let's continue. Um so in our case spark history server uh deployment in this case will be um API version apps version one the kind is going to be deployment metadata in this case will be spark history server name space as before then we have the labels spark history server just put this back one step all right and [clears throat] the spec uh that will be attached to this um deployment in this case replica is going to be one. Let's see. Okay, I'm going to Okay, just move this back a little bit. All right. Um, yeah, replica is going to be one. Our selector is going to be uh match labels pack server. Yeah, template is fine. Uh, our spec is going to be containers. Then the name is going to be this. So let's move our image just um uh a little further away from that. The same thing with our command. Command will be that. Okay. Then we have um opt start history server sh that's fine. Our env will be like that. Then we have spark history options. Okay. Let me see value. We have log directory time spark event. That's fine. 1880. So okay. So just move this one step out there and we should be good to go. All right. Now uh our port is going to be forwarded in there. This is like that. Volume mount path resources memory CPU. Okay. All right. Um yeah, that's that seems to be uh let me confirm volume mount. Yeah, temp. Yeah, that's fine. Um my resources is going to be 512 MGAB. The CPU is going to be using 250 mgaby of memory. Then I have my limit in this case. Then my readiness probe is going to be HTTP get. So I'm going to have a path. It's just going to use this. The port is going to be 18080 just to make sure that the health is fine and is able to accept connection. All right. The period is going to be 5 seconds. Initial delay seconds is um is fine. P second is 10 which is okay. We can reduce it to five if you want. Uh we have livveness probe is going to be same. Aside from readiness we have the livveness as well just to make sure it's live. Uh we have the p port is going to be 18080 18,080. The path is going to be forward slash as well. The same thing, initial delay is going to be maybe 20 or 30 seconds and the period is going to be 20 seconds. Uh maybe let's increase this a little bit. Okay. All right. Now, um there's one other thing that we need to add. It's going to be looms and this is going to be our spark event. So that is going to be connecting our spark volume uh claims to our spark event. in this case our claim which is uh spark history PVC. All right. And then the last thing we need to do according to my notes in here is to get our spark history service uh in in the picture. So we have API version is version one. The kind is going to be service. Um we have metadata spark history service. Name space is fine. The labels is okay. Then the app is going to be spark history server. So let's get our spec in. Type is going to be node port. [snorts] All right. The port is going to be port 18080. Target port is that node port is that. And then the name is going to be HTTP. The selector just move this back. All right. It's going to be spark history server. Okay. Good. I think we are ready to go with our spark history server and I can save this and apply this cubectl. CubeCTL apply. In my case, I'll just apply spark history server. Okay, so it says um our PVC is created. Spark history server is fine, but it got an error when trying to create our Spark history. There's a an error in here and the error is about cannot be handled as deployment strict decoding on spec selector app. So let's see spec in our deployment, isn't it? Spec selector app. Just tap this one time. Okay, good. So it looks like our spark history server is fine. Okay, and everything seems to be deployed now. And our PVC is correctly applied. So the history service is okay. All right. And this is what the template looks like. Now at the end of the day, our Spark history server seems to be okay. Now um we can apply that again and we we're good to go. So even if you apply it multiple times, unless you change anything, nothing is going to change uh in our config. All right. And last thing we need to do now is get our dashboard into the picture. So now that our Spark history server is fine, um you probably won't see much in here. You only see one pod uh currently running uh which is our Spark history server and the service is going to be Spark History Server as well. You can check the container for more information in here. So but you probably would be able to see something like this. All right, now let's go back. Now that our Spark history server is all right, let's set up our dashboard. So we need our dashboard yaml which is going to be where we'll be hosting our dashboard um to run. So I'm going to have dashboard in here dashboard. l and I'm going to have similar to what we did in our spark history server. Our API

Spark Control Dashboard

version is going to be v1. The kind service account. The metadata is going to be spark dashboard and the name space is going to be spark. So that's our service account. We have our API version uh ro access row metadata spark dashboard row and we have our namespeak or spark. So we have our rules in here. Rules is going to be port spot log get list watch delete. That's fine. Those are our rules. We have another rules for config maps which is going to be config maps. This is going to be get list what delete that's fine. Create as well. We have the services. Okay. So services is going to be similar. So we have similar verb across all of them. Then we have batch which is going to be jobs. The same thing. That's it. Uh we have um version one for role binding. Uh similar thing spark dashboard role binding name space is spark subject is fine. We have the subject as a service account in this case. So spark dashboard will be there. The name space will be spark. Okay. Now let's get our row reference which is going to be kind of row the name is going to be spark dashboard row the API group will be that that's fine and then we have our API version it's going to be here kind of deployment metadata spark dashboard the name space is spark then we have labels for app. The next thing will be our spec which is going to be replicas is one. Select match label dashboard. The template is fine. Metadata app labels. So in our template we have labels app that's fine. Then for our spec we have service accounts containers dashboard spark dashboard latest. So the image in this case will be spark dashboard latest which is what we're going to be building in our local um if you have an image on the cloud maybe you already have it on the ECR or your Azure container registry you can do that as well. So image pool is going to be never so because we're not using the outside pool. So port is fine I'm going to put this as 5,000. All right. The name is going to be HTTP. I'm just going to 5,000. Then we have our env. It's going to be spark master. I'll put this as spark name space name space. The value is going to be spark. And we have our job directory. name is going to be jobs directory and that will be apps job. So this is what where the jobs will be sitting in our kubernetes cluster. Okay. Then finally we have our resources which is going to be uh our request is going to be 128 mgaby not 256 128 mgaby of memory 100 mgaby is fine limit for for our deployment in this case is going to be 256 not 512 256 mgaby and this is because I already tried this up and um having higher memory um dashboard uh sometimes reduces the quality of um memories left on my system and sometimes things get slow. So that's why I'm trying to like reduce the memory size I'm attaching to this. So let's use 2200 m megaby in this case. Then our readiness probe. So we have uh readiness probe http get. Just like we had before, our path is going to be forward slash. The port is going to be 5,000. And then initial delay, initial delay seconds, period seconds, livveness probe, similar uh http get the path is going to be that port is 5,000. Initial delay seconds is 30 seconds. Um 10 seconds is fine as well. And that's all we need to do for where is it? Uh our replica deployment. Okay. So our deployment is okay. Now let's get the final thing which is the service into the picture. So API version one service uh metadata is that spark dashboard name space is fine. Label spark dashboard spec is okay. Type is node port. The port that we want to listen to is going to be port is 5,000. This is going to be port 5,000. The target is 5,000. Yeah. And um the node port is 3,50. 30,50. Protocol is HTTP. We don't need the protocol, do we? So name is HTTP and then the final thing is the selector and appspark dashboard. That's all. Um yeah, I know this is a lot um and had to get my notes to guide me as well. So but what you need to do is make sure that you have a service account the role um role binding deployment and the service. That's all you need to do. So and it cascade down from top down. So you need to make sure that you start with the service account role binding deployment and the service. Those are all you need to do in that case. Now um so now that our dashboard is here we can actually apply this uh but it's best we just need to have um an image because we are saying we are pulling spark dashboard latest. So we need to make sure that our spark dashboard is currently uh running and we can use that when we want to set up our uh dashboard uh implementation. So let's get in our templates. So I'm just going to have um let's put this inside of dashboard. So let's call this dash dashboard so we are clear as to what this look like. Okay. So put this inside of dashboard and um inside of dashboard we have a template which is going to be the way the UI is going to be structured. We have requirement. txt as well in there. So I'm going to have my Python file which is going to be um an API a flask server that's going to be interfacing with my with all the functions that I want to get with my Kubernetes. So think of me getting uh you know uh retrieving my pods from Kubernetes reading from logs you know starting and stopping a particular pod or service deploying uh a particular job on the on Kubernetes as well. So the flask server is going to be my API layer on top of the manual terminal rules and commands that you need to write on your local system. So the reason for that is it becomes cumbersome because you have to remember every single command. Uh but with this UI you can easily just uh upload your Python file into the dashboard. Then the dashboard is going to submit it to your Kubernetes cluster that is running on your local system. So you don't have to like you know remember everything uh off the top of your head when submitting the job. And that's why I I'll name that app. py. I'll just copy

k8s API layer

that from my notes in here and paste it in here. So um so essentially what I have uh is a flask server. Um you don't necessarily need this. So we have an app name. So we have uh we load the Kubernetes config which is set up somewhere in inside of our load incluster config. And if there's nothing in here, we use the cube config. That's all we need to do in here. We set up our version one client, which is the interface between our Kubernetes cluster and um our local system. So our spark name space is spark, which is if you can still recall we are using our spark name space and the jobs directory is app for/ jobs. Now we have a couple of functions in here where we say get the jobs and uh this is going to get me all the ports in the spark name space. The same thing in my spark name space literally this is duplicated so I don't suppose I need both. So just one is fine. All right. Okay, so I get my uploaded jobs, get my ports log, then the API configuration. In this case, I have my forward slash, which is the landing page. Uh, it's going to just render an HTML template. I have my API status to just keep the service running and make sure that Kubernetes always know this particular service is available. And this is just going to try to get the I see. So I have my get jobs and get ports. Why do I think this is the same before? So it's just like J& P and my brain just think they're the same. All right. Uh so inside of my API status I get my jobs and I get my ports. This is where I do the uh data extraction from Kubernetes. I get the logs. I upload them. when you want to upload a particular job into the Kubernetes cluster then you get the you get to run this particular job via the UI delete it and that's it. So it's just like basic CRU uh but this is against your Kubernetes cluster. All right. Um so that is done. I'll get my final template in here which is the index html. So I'll just paste it in here. So this is what the index html look like. You'll see once I run this as we move on. Uh inside of my dashboard is where the requirement txt will be. And that's all I need to do. Uh so I just saved this up and then I'll just cube cubectl apply f cubernetes and I'm going to be applying my dashboard. Okay. And you can see that um my dashboard is created and this is going to be there. But the only thing is we need to uh build our the spark job in here for the dashboard the container not the spark job. We need to get our docker file in into the picture. So we build an image in here. So it's just a simple um installation. So we just do from Python 3. 11 slim. We do a work directory is here. Copy requirement. txt. We run a pip install. Copy application in here. Then the templates. Then our env is going to be app. py spark name space is that expose 5000 and run gunicon. So just run it on 0. 0. 0. 0 5,000 and you can decide the number of workers threads um and app. That's all. So we just do a build. So uh so I'll just say cd dashboard and then docker build. So I'll build this particular dashboard image. So I'll say I'll target spark dashboard latest and then inside of my current directory. And that's going to build the image for me. So there's one tiny bit of um a challenge that I currently have and that is because when I was rolling when I was um deploying this before I built this image I already applied the dashboard before building the image and you can recall inside of our dashboard that we are leveraging this spark dashboard latest that we don't have. So we need to make sure that now that we've just built this particular image and it's currently on our local system, we can restart and let the deployment know to pull the new image. So we just do cubectl roll out restart roll out uh I'll restart deployment in this case uh this is going to be spark dashboard and the name space right so I'll just restart the deployment so you can pick up the image from there and let's check out uh OB upstack dashboard. So I have my spark dashboard uh which is running uh 12 seconds ago and 3 minutes ago. So this is completed. It will be removed and this is the latest one 17 seconds ago. I have my spark history server and if you check my container in here so I have a couple of um containers that will be created on my on my Kubernetes cluster that we can now leverage. So I'm going to just um open this on my local. So if you see my spark dashboard and go to the info where is it spark dashboard. So this is the port that is running that and this is my spark dashboard. All right. So what is the port that we said uh spark dashboard will be on? The port is 3,50, isn't it? I think it's 5,000 because that's our node port. So, I'm just going to move this here. I'll say local host 5,000. Yeah. 3,50 30,50.

Spark Dashboard, Job submissions and review

30,50. So, this is our spark cubernetes dashboard. And currently on this so instead of accessing this directly on the target port we are exposing our node port not the the port locally in here. So this is our interface with the world which is 3,00 30,50 and on my kubernetes dashboard um I have my realtime dashboard. So I currently have two running ports. Nothing is completed nothing is pending and nothing is filled. So, what we need to do is drop uh our Spark job in here. So, I can just go into my file. Let me pull that up. And in here, I'll just I have my simple counter and I'll just upload this. So in my simple counter which is my simple counter job I'll just upload that and then this is going to get created on my spark cluster and run this particular job. So this driver is already picking this up and it's going to submit it to the executor. The executor is going to run it and running back to this driver and returning at the end of the day. So you can see what this looks like. So the executor is picking this up now to run and then once this is done. So he's running two executors at the same time, one 81 and 82. So if there's a need to have multiple executors, you probably would have multiple ports that are running together to get this solved. At the end of the day, the whatever the execution results look like is going to be returned back to us. So let's wait a few more seconds for this to be done and we should see uh these executors deleted and then yeah so this is almost done now and you can see in my driver have a result returned to me. Now if you look at the aggregations that we just did so we said we have 1 million elements across 10 partitions that's like 10,000 each um 100,000 each rather. Then the sum is this. The average is this. The minimum, the maximum is this. Then we get the odd and even numbers here. I mean this is just like a demo job. So even if the logic is wrong or whatever, it doesn't really matter. What matters is we see the logic being submitted into Aspar cluster and our executors are getting run to run the task. And you can see what this look like. We are performing the aggregation. The odd number for the first five uh elements that we did show five uh rows which is fine and our aggregation by type is even and odd numbers and then our spark job completed. So in 26 seconds so essentially this is what it look like and if you want to see what happens in our history server you can uh let's see our history server doesn't seems to be started but it should be started. Uh before we do that, let's get um a sample im uh a sample job again into our So I have another word count job in here that we can test with. Um inside of this word count job, I'm just going to open this in Finder and I'll upload that. So in here, just upload in here. Upload. And the word count also picks up. get fix up and then get executed. So let's see why our history server is not picking up. So 32 80. Uh it looks like I did a wrong port binding in here. Local host. Yeah, this is our spark history server. So which is currently on 32,080. So we just need to update our spark history server uh sorry the HTML in here 30,80. All right I'll change this to 32,080 and um I can roll out my restart for my Spark dashboard but wait for this to be done. um the execution that is currently running not that it does it affect anything the job's already submitted to our Kubernetes cluster so it doesn't really matter so we have two ports that are currently running and each of them will be returning result to the uh to the driver once is done and you can see what this look like so we have some text in this case which is going to be something like spark and kubernetes for something like this. Um, and you can see uh in my word count, this is what my sample text looks like. I just copied this from uh Grock for it to generate a sample text and just do the word count for me. And you can see spark uh appears 15 times and the rest like that. The spark for spark related the counts the mean and all that and it get completed in 37 seconds. Now one thing you should also understand is something like something called speed up. Yeah. Now running a simple word count app or simple counter app wouldn't give you much result if you're running it on a cubernetes cluster because the overhead will be too much uh as compared to the benefits which is the job. So it's like using uh a machine gun to kill a mosquito. It it's just like too it's just not proportional. So in a case where you want to use something like Kubernetes, you want to have a very big job that is so complex or is very big that a single system cannot solve the problem on time. So that's when you leverage something like Kubernetes then the speed up will be really great because you have higher big higher job uh versus the operational head. So otherwise your your performance is going to plateau early instead of having high C high CV at the end of the day. So you can read up on speed up if you want but get it get interesting um as to the kind of job that you can submit to your Kubernetes cluster [clears throat] and how you can how your Kubernetes cluster can help you process most of the spark job. Not only spark you can use something like flink jobs uh any other operators that you decide to use at the end of the day. But essentially that is all there is for our Kubernetes cluster. So if you want to redeploy and uh you know restart our Kubernetes dashboard to reflect the

Outro

latest uh port that we uploaded into the into our configuration file. You probably would see something like this. So if you check your ports, so you have the ports that are getting created and restarted. So this is 18 seconds ago for our dashboard. So that is applied now and you can refresh this. So if you refresh this page, you probably would have uh the latest version. And if you click on history server, it's still picking up the but you get the point. It's just like update this um history server in here and we're good to go. All right. So it looks like it's still not um getting applied. Maybe we need to wait a few minutes, but that's it really. Um, I'm going to be putting the link to the source code and everything in the comment section. So, uh, if you follow up until this point that you've not subscribed, don't forget to like, comment, share, subscribe, and thanks for watching. You've been very fabulous. Have a nice day and I'll see you next time. Cheers.

Другие видео автора — CodeWithYu

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Дайджест Экстрактов

Лучшие методички за неделю — каждый понедельник