# Run LLMs on Kubernetes with LLMKube

## Метаданные

- **Канал:** That DevOps Guy
- **YouTube:** https://www.youtube.com/watch?v=xdMtc8jm88Q
- **Дата:** 13.05.2026
- **Длительность:** 12:24
- **Просмотры:** 2,802

## Описание

Follow the DevOps roadmap👉🏽 https://www.instagram.com/marceldempers
My DevOps Roadmap 👉🏽 https://marceldempers.dev
Patreon 👉🏽https://patreon.com/marceldempers

Checkout the source code below 👇🏽 and follow along 🤓

Also if you want to support the channel further, become a member 😎
https://marceldempers.dev/join

Checkout "That DevOps Community" too
https://marceldempers.dev/community

Source Code 🧐
--------------------------------------------------------------
https://github.com/marcel-dempers/docker-development-youtube-series

Like and Subscribe for more :)

Follow me on socials!
Instagram | https://www.instagram.com/marceldempers
X | https://x.com/marceldempers
GitHub | https://github.com/marcel-dempers
LinkedIn | https://www.linkedin.com/in/marceldempers

Music:
Track: souKo - souKo - Parallel | is licensed under a Creative Commons Attribution licence (https://creativecommons.org/licenses/by/3.0/)
Listen: https://soundcloud.com/soukomusic/parallel

Timestamps:
00:00 Intro
00:03 llama.cpp
01:12 what is llm-kube
01:42 define models as YAML
02:19 Creating a k8s cluster
02:41 The Documentation
03:56 Installing llm-kube
04:23 Check the installation
04:45 The new CRDs
05:39 The Model
06:48 The InferenceService
07:58 Under the hood
09:33 Testing\Using our Model
10:07 OpenAI endpoint (OpenCode)
11:10 The Source Code
11:35 Outro

## Содержание

### [0:00](https://www.youtube.com/watch?v=xdMtc8jm88Q) Intro

— In a previous video, we've learned how

### [0:03](https://www.youtube.com/watch?v=xdMtc8jm88Q&t=3s) llama.cpp

to run LLM models locally on our machine using something called llama. cpp. It's a pretty awesome command-line utility that allows us to host LLM models and serve them on an endpoint. We can also use our favorite command-line AI terminals to point to our local models. And because it's command-line, it becomes extremely portable, meaning you can run it in something like a Docker container. And if we can run it in a Docker container, that means Kubernetes cluster. That means we could potentially host models in a distributed manner, scale them, and manage them just like any other workload in our cluster. Now, this is exactly what LLM Cube does. In this video, we're going to be taking a look at what LLM Cube is, how it works, how to get it up and running, and how we can deploy a model and run its inference on top of Kubernetes. And we can even go ahead and connect it to something like Open Code. So, that's a lot to get into, so without further ado, let's go. — Now, LLM Cube is a Kubernetes operator

### [1:12](https://www.youtube.com/watch?v=xdMtc8jm88Q&t=72s) what is llm-kube

that helps you host LLMs on your own hardware using runtimes like vLLM, llama. cpp, and more. And it is fully open source on GitHub. So, you can use your favorite kubectl to manage models declaratively. It basically automates llama. cpp. So, rather than us building Dockerfiles, creating YAML files, creating stateful sets and volume mounts, it does this all for us automatically. So, you go ahead and

### [1:42](https://www.youtube.com/watch?v=xdMtc8jm88Q&t=102s) define models as YAML

define a model as a YAML file. You provide details such as where to find the model, its hardware requirements, as well as resource requirements. Then you define how to run the model with an inference service. So, how many replicas you want, the runtime such as llama. cpp, any auto scaling settings, how you want to expose the model via an endpoint, and this will translate to a Kubernetes service. Then you can also provide resource values for that runtime. So, once you have that running, you'll have an endpoint inside of Kubernetes as a service, so you can go ahead and query your model. So, first things first, we

### [2:19](https://www.youtube.com/watch?v=xdMtc8jm88Q&t=139s) Creating a k8s cluster

need a Kubernetes cluster, and in this video I'd like to use a utility called kind to create a lightweight Kubernetes cluster running in a Docker container. To create my cluster, I say kind create cluster. That'll give me a one node Kubernetes cluster that we can use for testing. Now, with that cluster up and running, I can say kubectl get nodes, and we have a one node Kubernetes cluster ready to go. Now, LLM Cube also

### [2:41](https://www.youtube.com/watch?v=xdMtc8jm88Q&t=161s) The Documentation

has great documentations with a getting started guide on how to get it up and running, the prerequisites. They also have their own CLI tool with two main methods to install the operator, either use Helm or customize. Now, I want to also em- -phasize that LLM Cube can do a lot more than just hosting a model. There are quite a few guides over here, specifically around air-gapped installs, where you have no internet, how to pass models in an offline environment. This can be particularly useful for compliant environments that are running on private networks. Also, multi-GPU sharding. So, if you have something like a MacBook or a machine with a GPU, perhaps these environments cannot run Kubernetes. There are ways to get agents installed that connect to LLM Cube, so the orchestration can still happen in a Kubernetes cluster. This means you can run it on something like a home lab and use your MacBooks powerful GPU or processor to run the model. They also have guides on GPU setups and model caching. I would highly recommend to take some time to go through these guides. In this video, we're just going to get it up and running, and I'm going to show you how to use the product, how to use llama. cpp that we're already familiar with and how to host the model and actually use it. So, installing LLM

### [3:56](https://www.youtube.com/watch?v=xdMtc8jm88Q&t=236s) Installing llm-kube

Cube is actually very simple with Helm. I run helm repo add, add the LLM Cube Helm repository. I usually search for versions first, find the latest version, and I pin that in an environment variable. Then I simply say helm install, install it in a given namespace, and pass the version I want. So, helm repo add, helm repo update, set my chart version that I want, and then say helm install. And there we go, it's now installed. To check the installation, I just say get pods in the

### [4:23](https://www.youtube.com/watch?v=xdMtc8jm88Q&t=263s) Check the installation

LLM Cube system namespace, and we can see we have our operator now running. This has given us two new CRDs, so I can say kubectl get CRDs, pipe that into grep, and just type LLM Cube. And we have two CRDs. One is called an inference service, and the other one is called a model. So, let's talk about these two CRDs, because this is at its

### [4:45](https://www.youtube.com/watch?v=xdMtc8jm88Q&t=285s) The new CRDs

core how LLM Cube works. The first one is the model. The model describes what to run. So, the name of the model, where to download the model file, the hardware needs. So, you might have a model that needs a GPU, or you might have a lighter model that uses CPU only. This means that the scheduler can decide where to place this model on which node, and then use the power of Kubernetes to schedule that model correctly. The inference service describes how to run the model, what LLM runtime to use, like llama CPP, resource limits and requests. Do I want this model hosted on a load balancer with a public IP address or a private service? You can think of the inference service as the pods behind the scenes and the service to expose those pods. And those pods will be running the model. Now, I'll show you how to get the source code in a bit, but here I have a

### [5:39](https://www.youtube.com/watch?v=xdMtc8jm88Q&t=339s) The Model

model described in a YAML file. In this example, I'm going to host the small Gemma model. We've talked about Gemma in our introduction to Gemma 4. Here we describe where to get the model from and in this example we're using hugging face, which is kind of like the Docker Hub for LLM models. Then you have various settings. So here I have the quantization I want to use for this particular model and you can have some hardware settings. Now because I'm running Gemma 2, which is a small model, I've disabled GPU. So I'm purely going to be running this on CPU. So this YAML file describes the model, its settings and its resource requirements. Deploying the model is really simple, just using our native kubectl apply commands to apply the YAML file. Jump into the terminal and paste that. That will go ahead and create the model. I can say kubectl get models and we can see our model is ready. I can also use the famous kubectl describe model and we can see the model has basically been accepted. There's no issue. The source is a remote URL. So when we create the inference, the inference pods will go ahead and fetch the model and cache them. And next up we have the inference

### [6:48](https://www.youtube.com/watch?v=xdMtc8jm88Q&t=408s) The InferenceService

service, which basically tells LLM Cube how we want to run this model. So we create a name for the service we want to host. Here is the runtime. So we're going to be using llama. cpp, the model reference. This is the model we created in the previous step. Replicas, the number of replicas we want to run. I'm just going to run one. How I want to expose this model. So in this case I want to expose it on port 8080 via an open API compatible endpoint. So I can go ahead this up with open code. In my example, I'm just going to use cluster IP. This is because kind clusters don't have load balances and I'm just going to port forward to it to test and show you how it runs. You can then further restrict the actual pods that host the model with resource settings. Applying that is also very straightforward. Just use kubectl apply. I go ahead and run that. That will create our inference service. I can do kubectl get on that inference service and we can see that it is in a creating phase and this might take some time depending on your internet connection cuz this will go ahead and download the model. We can see that by running the kubectl describe command. We can see that it is progressing, still in a creating phase. Under the hood, what this does is

### [7:58](https://www.youtube.com/watch?v=xdMtc8jm88Q&t=478s) Under the hood

creates a service, a new service with the name of our inference service, its settings, in this case cluster IP, and here's the port it will expose. Also, behind the scenes you'll see a pod created, one because that's what I specified, and it's in initialization phase. I can do kubectl logs on this and we can see that there's a model downloader. I can do minus C on that, grab the logs for that, and we can see that it's now downloading the model, which is in progress and this might take some time. So, after some time, if we check the logs of the init container, we can see that it's downloading the model and here it's downloaded the model successfully. The Gemma 2B model is around 4. 6 GB. This also means if we do kubectl get pods, we can see that our one pod is now up and running and we can access this model via the service endpoint that's been created. So, the inference service basically just allows us to define the resources underneath that will run the model. In this case, it's just a Kubernetes deployment with one pod and a service to expose it and it automates all this llama. cpp stuff for us and gives us an OpenAI compatible API endpoint that we can route traffic to. Now, how we use this is up to us. We can have AI agents running in our cluster accessing this Kubernetes service. We can do this privately, directly between a pod and a service. We can have something like an AI gateway, basically using things like Gateway API, where we can route traffic between different models. So, this opens up a whole world of possibilities for us. Now, in this example, I'm just going to

### [9:33](https://www.youtube.com/watch?v=xdMtc8jm88Q&t=573s) Testing\Using our Model

keep it simple by port forwarding to that service. Use kubectl port forward on port 8080. Going to go ahead and run that in my terminal and I'm going to keep this open in the background. And with that running, I can go ahead and send a curl request to that model. This will hit that inference service with a payload. And here I'm just going to send a prompt, "What is Kubernetes in one sentence? " So I jump to the terminal, I paste that curl command, and there we go. We have a response. In this case, we got a response from our Gemma 4 2B model. Now because this exposes an

### [10:07](https://www.youtube.com/watch?v=xdMtc8jm88Q&t=607s) OpenAI endpoint (OpenCode)

OpenAI compatible endpoint, we could connect it to something like OpenCode. Now how you use these models is entirely up to you. Running agents in a cluster, outside of the cluster, whether you want to use it for your own home lab and route your OpenCode to there, that's entirely up to you including how you want to set up the networking. But it's entirely possible to do this. Because we have that inference service, we have an endpoint that we can access. So just like any OpenAI compatible provider, we can hook this up as a provider inside of OpenCode. Set our URL and the model settings that we want to use. Now as long as the networking is set up correctly, I can switch models. And with port forwarding running, I can access that model I've just deployed. And I can ask a question. This will route the traffic via port forward to my LLM Cube inference service and route it to that one pod Gemma 4 2B model. And there we go. There's our prompt, the thinking, and the answer from our Gemma 4 2B model running on our Kubernetes cluster using LLM Cube. So if you're interested in the

### [11:10](https://www.youtube.com/watch?v=xdMtc8jm88Q&t=670s) The Source Code

source code, all of the source material in these guides are on GitHub. The link is down in the description. For this video specifically, you can go to the root of the repo. There's an AI folder, inside there's a Kubernetes folder with an LLM Cube folder and a readme. This readme has the introduction, all of the steps we followed today including creating the kind cluster, and LLM Cube, the CRDs, and how to test it. So, hopefully this video helped you

### [11:35](https://www.youtube.com/watch?v=xdMtc8jm88Q&t=695s) Outro

understand how to run LLM models inside of a Kubernetes cluster. Be sure to check out the Llama CPP video as well, so you can learn how to run these locally on your machine. In a future video, we'll take a look at AI Gateways. How do you route traffic between these models running either inside of your cluster or in the cloud like Claude models, Gemini models, or any provider. If you like the video, be sure to like, subscribe, hit the bell, and check the link down below to follow the ultimate DevOps roadmap. And if you want to support the channel even further, hit the join button down below to become a YouTube member. And as always, thanks for watching and until next time. Peace.

---
*Источник: https://ekstraktznaniy.ru/video/51717*