When The AI Model Comes With RCE
9:03

When The AI Model Comes With RCE

Mental Outlaw 25.04.2026 42 824 просмотров 1 802 лайков

Machine-readable: Markdown · JSON API · Site index

Поделиться Telegram VK Бот
Транскрипт Скачать .md
Анализ с AI
Описание видео
In this video I discuss CVE-2026-5760 a critical remote code execution vulnerabilties that was discovered in SGLang's Jinja Templating engine. If you're running SGLang it's recommended to use sandboxed jinja templates and ensure you aren't loading malicious GGUF Models to avoid a takeover of your AI server. My merch is available at https://based.win/ Subscribe to me on Odysee.com https://odysee.com/@AlphaNerd:8 ₿💰💵💲Help Support the Channel by Donating Crypto💲💵💰₿ Monero 45F2bNHVcRzXVBsvZ5giyvKGAgm6LFhMsjUUVPTEtdgJJ5SNyxzSNUmFSBR5qCCWLpjiUjYMkmZoX9b3cChNjvxR7kvh436 Bitcoin bc1qdc32p8035ztyvtm8t97gdcyhc26jg6cte9qc8n Ethereum 0xeA4DA3F9BAb091Eb86921CA6E41712438f4E5079 Litecoin MBfrxLJMuw26hbVi2MjCVDFkkExz8rYvUF

Оглавление (2 сегментов)

Segment 1 (00:00 - 05:00)

There's a newly discovered unpatched critical vulnerability in SGlang that turns a model server into a code execution target. And the scary part is that the attack can ride in through a model file and not just do a standard HTTP request. Now, before getting into the details of the exploit, let me break down the basics of what SG lang actually does. So, you probably already know that large language models consume a tremendous amount of resources. They're one of the primary reasons why GPUs, CPUs, and RAM are so expensive and so unavailable right now. And virtually every company that offers an online service is trying to shoehorn LLMs into that service because it's a hot new thing and maybe it's going to increase their profits. But to keep the number on chart going up, these large language models have to be able to scale with their customer base and throwing more compute power at the problem is obviously too expensive. So the companies offering these AI services use inference engines like SGLANG which sits between the user and the AI model to make sure that the hardware isn't wasting time recalculating things that it has already seen. It also organizes the memory of the conversations or KV cache more effectively which eats up a significant amount of VRAM when these models are running. The end result is faster processing of inputs and generation of outputs when prompts from multiple users are being thrown at a single GPU for processing. It's the classic paradigm of optimizing software in order to get the most out of the hardware that it's running on, just applied to large language models. So, it's safe to say that a lot of companies running and hosting their own AI models are using engines like this. Not necessarily SGL lang because it is one of the newer kids on the block, but it's quickly rising in popularity and becoming a favorite for structured generation and complex agentic workflows. Even the official documentation of SG Lang says that it powers large-scale production deployments, generating trillions of tokens each day across more than 400,000 GPUs worldwide. So, this isn't just some obscure AI project. Now, unlike a lot of bugs or hacks that are reported with AI software, the issue here isn't the usual. Someone typed a weird prompt and the model misbehaved story. You know, this isn't another chat GPT Dan situation. This is a serverside template injection problem that is tied to a gguf model file carrying a malicious tokenizer chat template field. The dangerous input actually lives in the model's metadata. So, the problem starts before normal prompt handling even becomes the main issue. The vulnerable feature is the reranking path. SGL langs reranking endpoint uses a cross encoder model to rerank documents by relevant to a query. But in the vulnerable workflow, it also renders a model supply chat template. That means a component that is meant to score relevance can end up acting like a code execution bridge if the templating engine is unsafe. Now, the vulnerable code path uses Ginga 2 without sandboxing instead of a lockdown environment. Now, if you're not familiar with Ginga 2, it's a pretty popular templating engine that's written in Python used to generate dynamic HTML or other markup formats. So, it's great for rendering the different outputs that you get from these models in a visually appealing way. But when you render attacker control template content without proper sandboxing, the template expressions can break out of a harmless formatting and start reaching Python's internals. And this is why the serverside template injection can eventually lead to remote code execution. We actually have a public proof of concept for this exploit which is explained pretty plainly in the readme and it comes with a fairly straightforward Python script. So the attacker creates a GGUF model with a malicious tokenizer chat template which includes the Quen 3 reranker trigger phrase to activate the vulnerable code path in serving rerank. py. The victim running sglang has to download and load the malicious ggf model which is a binary format that's used to package lms. So, the attacker also has to conduct a little bit of a social engineering or dependency supply chain attack in order to actually get that model onto the victim server in the first place. But with how quickly businesses like to deploy models, this might actually be easier than you think because the operators of these services routinely download models from public repositories and are constantly experimenting with new optimized reranker models because of the need for efficiency with the hardware bottleneck that I explained earlier. And of course, they're oftentimes just blinded by the quick money grab. So if a hacker uploads a model called Quen 3 optimizeer. gg GGUF to a model hub like hugging face. It might not receive the amount of scrutiny that you'd expect, especially towards the chat templates, which are usually thought of to be pretty harmless. Now, in most cases, once the malicious code is loaded, the attacker actually doesn't need any further interaction with the service in order to

Segment 2 (05:00 - 09:00)

compromise it. The malicious payload sits dormant inside of the chat template. And when the service processes a request through the reranker pipeline, SG lang renders that template and that is when the execution happens. And it happens because this reranker models are expecting formatted input. The servers that are running these programs don't just pass raw text into the model. It gets formatted with that chat template defined inside the model's tokenizer config. So the exploit abuses this by embedding a malicious Ginga 2 template inside of that chat template field. So when any request hits the rerank path, SGlang reads the chat template and renders it with the unsandboxed Ginga 2 environment which executes arbitrary Python code on the server that can lead to a takeover of the host that's offering this AI service. And then from there the possibilities are endless. If you own the server that's offering the service to paying customers, you can just deploy attacks against the users of the service. You can excfiltrate sensitive data about them. You can make lateral movements into other systems that the compromised server has access to. Or you could just do a good old denial of service attack. So, since the vulnerability requires the victim to download a compromised model first, the obvious solution here is to properly vet the models that you're downloading and that you're going to use in your AI service. Like, obviously, if you download malware, there's a good chance that it can take over your system. And this is really where one of the big issues are because again, so many businesses are looking to use AI in some type of way. A lot of them want to use their own self-hosted models because Chad GBT and Claude's APIs are getting more expensive by the day. Plus, they have all of these safety rails applied to them that make it less capable at performing the task you want compared to a self-hosted AI, even if it has fewer parameters. And it also makes sense from a business standpoint if you are able to get your hands on the graphics cards because the AI boom has caused those components in particular to retain their value for a lot longer than they used to compared to back when Nvidia was just associated as a gaming company. So if you can invest in the hardware to run your own local models, you're actually building equity compared to a Chad GBT subscription that doesn't have any resale value. If your business is actually successful and you get a lot of customers, then you can scale for a lot cheaper than paying more for these remote APIs. And of course, if you're running your AI models locally in house, you have much better data security, assuming your servers aren't vulnerable to attacks like this, then you do compared to using Open AI service, which mines your data and sells it to the government. But design-wise, a lot of this could just be avoided by using sandboxed Ginga templates. The proof of concept exploit literally requires the default unsandboxed Ginga environment in order to work. Just like CVE 2024 34359 codenamed llama drama which is another critical rce flaw in the llama CPP Python package which actually has now been patched but unfortunately it doesn't look like SG lang has patched this particular vulnerability yet but that extra security step of manually replacing the default Ginga 2 environment with Ginga 2 sandbox in the framework would prevent expos exposing dangerous objects into the template context. So, make sure that your services are as secure as possible before just rushing to deployment. That's the main thing to keep in mind, especially in this vibecoded era where everybody wants to deploy quickly. If you enjoyed this video, please like and share it to hack the algorithm and check out my online store, base. win, where you can buy my awesome merch and accessories for your phone or laptop. 10% storewide discount when you pay with Monero XMR at checkout. Have a great rest of your day.

Другие видео автора — Mental Outlaw

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Дайджест Экстрактов

Лучшие методички за неделю — каждый понедельник