# RUN AI MODELS ON k8s! #ai #llm

## Метаданные

- **Канал:** That DevOps Guy
- **YouTube:** https://www.youtube.com/watch?v=_HlwEc8rdhg
- **Дата:** 19.05.2026
- **Длительность:** 2:24
- **Просмотры:** 2,927
- **Источник:** https://ekstraktznaniy.ru/video/51715

## Транскрипт

### Segment 1 (00:00 - 02:00) []

LLM Cube allows us to run production-grade language models on our own hardware. It's basically a Kubernetes operator. This means if we have a Kubernetes cluster, in our case we could use something like kind, we can go ahead and define models using YAML. Here I have a YAML file that describes running Gemma 4, the 2B model. You provide a link on where to get the model. In this case I'm using Hugging Face, the model format, quantization settings. We can define hardware requirements as well as resource requirements. This means we can then leverage the power of Kubernetes scheduling to schedule this model onto the right hardware. This hardware could be a MacBook running a silicon processor or some kind of device with a GPU. LLM Cube will do the magic to distribute pods and use the power of Kubernetes to run these models. Once I have the model deployed, I can describe how to run this using an inference service. Think of this as the service with the pods that are running the model. I create one for my Gemma 4, the 2B model, the runtime I want to use in this case llama. cpp, the reference, which is the model we've just looked at, number of replicas, and that'll determine how the model will run. Then I can provide an endpoint how I want to expose the model, the port, and an OpenAI compatible endpoint. And here I use a service cluster IP with some resource requirements. LLM Cube will take this inference service and spin up pods and services behind the scenes to make this happen. I can then use my favorite kubectl tool to go ahead and get my model. You can see my model is ready. And after deploying the inference service, it'll go ahead and download the model, store it in a persistent volume, and mount it to pods. I can do kubectl get pods, and we can see our pod is up and running. I can use a tool like kubectl port forward to that pod. I can then use something like curl to go ahead and send a prompt to that model. And just ask, "What is Kubernetes in one sentence? " Go ahead and run that and then I'll get my answer. This opens up a whole new world for how to run LLM models, not just using Docker and running them on your machine, not using a UI, but you can orchestrate it in a platform like Kubernetes. If you want to see the full guide, we can also then take a look at LLM Gateways and start looking at how do we route traffic between all these models. If you're interested to learn more about LLM Cube, check out the full video link down below.
