Host Your Own AI Code Assistant with Docker, Ollama and Continue!

Host Your Own AI Code Assistant with Docker, Ollama and Continue!

Wolfgang's Channel

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Оглавление (4 сегментов)

Segment 1 (00:00 - 05:00)

Every time I hear about Github Copilot, I think to myself “hm, it would be nice to have a similar service” but one that I can host locally, with no need to send data to Microsoft. And when Github Copilot inevitably got caught leaking API keys and secrets from Github repositories, the self hosted part pretty much became a requirement to me. Fast forward to 2024, and finally, you can self host our own code assistant tool on commodity hardware using Docker and Ollama to run our large language models. But why? Well, whether you like it or not, a lot of people who write code for a living, use generative AI tools, like ChatGPT or Github copilot, to some extent Now personally, I’m not interested in AI writing all the code for me, so that I can kick back and watch some House instead. What I want from a quote unquote AI code assistant, is more intelligent and context-aware auto-suggestions. For instance, when I code an Ansible task that has to do with files, I want it to automatically suggest things like owner, group and permissions. I would’ve typed those anyway, but why do that, if you can have the machine do it for you. And I’m pretty sure that if you write code for a living, you have your own examples of boilerplate stuff that you’d rather have your text editor suggest to you automagically. But let’s zoom out here for a second ChatGPT, Github Copilot, Anthropic, all of those services need entire data centers full of beefy compute nodes to run smoothly. So can we really expect the same quality and speed from relatively cheap commodity hardware? Well, that’s what I want to find out in this video, and to do that, I’ve prepared two hardware setups Our first contestant is LattePanda Sigma. It’s a single board computer based on Intel’s Raptor Lake architecture and it’s powered by a 12-core Intel i5 1340p. It’s also got 16 gigs of DDR5 RAM, which is not upgradable. At idle, the LattePanda Sigma consumes as little as 5W, and even at full load, this little computer only draws as much as 60W but we’ll also be measuring exactly how much it draws while running a large langauge model later on in the video. However, I wouldn’t recommend this exact machine regardless of the results It’s fairly expensive for what it is, and being a single board comuter, it doesn’t come with a full enclosure. Instead, you should take a look at MiniPCs, from companies like Beelink, Topton or Minisforum. A barebones MiniPC with the same kind of CPU would set you back around 300 to 400€. So that’s what I’ll be using for price comparison later on in the video Our other set up is my gaming PC. Based on a Ryzen 7 5800x3D, as well as AMD’s top GPU at the time of making this video – AMD Radeon 7900XTX with 24 gigs of VRAM. I built it last summer, and at that time, AMD was the best bang for the buck, at lest for my personal needs. As you can see, I’ve removed the cooling shroud from the GPU, and replaced with two 120mil noctua fans, as well as some 3D printed ducts and some well placed rubber dampeners. That way, the GPU runs way quieter, and only gets a tiny bit hotter under load. And yes- [record scratch] I've literally have no idea what's going on here So please just ignore it... And yes, the 7900XTX is fully supported by Ollama, so we will be able to run large language models on that GPU. Now I’m not entirely sure if an NVIDIA card would still provide better results, but the closest NVIDIA card when it comes to VRAM, is the 4090, which costs more than this entire build. This PC cost me around 1500€ back in Summer 2023, which on the other hand, is quite a bit more expensive than the MiniPC, obviously. And I am really looking forward to seeing how those two builds compare against each other. Now that being said, a lot of people run Ollama on their laptop, so theoretically, you don’t even need a separate device for it. However, this is a channel about home servers and homelabs, so chances are – you’ve been looking for some more stuff to self host anyway. Running Ollama on a separate device means that you can actually share it with a friend, or use it on multiple devices at the same time. So now that we’ve talked about the hardware, let’s talk software For the operating system, I’ll be using Ubuntu Server 22. 04 – i’ll explain why I went with Ubuntu this time, instead of Debian, later on in the video We’ll also be running our Ollama instance in Docker. as well as Open WebUI, which is an open source WebUI that would let us chat with our large language models But the main way we’re gonna be interacting with our models is through our text editor, so on my main development machine, which is a Mac, I’ll be using VSCode as my text editor, as well as a plugin called Continue. Now personally, 99% of the time, I use Neovim to write code. However, there actually aren’t any great plugins that would provide the kind of functionality I’m talking about, that is, using a local LLM as like an auto-completion on steriods, instead of just a glorified chat window inside your text editor. Now don’t get me wrong, plugins like model. nvim or gen. nvim

Segment 2 (05:00 - 10:00)

also let you define custom prompts, macros, and do other cool stuff, but the only Neovim plugin I found that uses LLM for auto-completions is llm. nvim. And unfortunately, I wasn’t able to get any kind of good results out of it. The text generation was slow, even with lighter 3B models, and even after tuning paramters like context window, the results it would generate were pretty rough, and seemed to lack newline characters entirely. Now, of course, it could purely be a skill issue on my end, but the solution I eventually went with, required, like, zero fine-tuning. Continue does exactly what it says on the tin, and that is providing Github Copilot-style code assistance but using your local Ollama instance instead of Microsoft’s servers. Now they do also allow you to use SaaS providers, like ChatGPT, Mistral and Anthropic, but we’re obvsiouly not gonna do that in this video. According to the developers, a Neovim extension is in the works, and that would actually be really nice, for those of us who don’t use VSCode or IntelliJ Now that I’ve bored you to death with the software part as well, let’s finally install our operating system, and configure Ollama! I started off with setting up my beefy desktop machine first, to kind of get a baseline of ideal performance, under the circumstances. Now usually, at least since a couple of years, I tend to go with Debian for my self hosted project videos. However, it seems that ROCm, which we need to run large language models on our 7900XTX, is not supported on Debian. I’ve tried to get it to work, but in general, the consensus in the community (by which I mean, Reddit) seems to be to just stick to Ubuntu. So that’s what I did, and after installing Ubuntu Server 22. 04, I also installed the deb package for the driver installer from AMD’s website, and then ran the amdgpu-install script with use case equals rocm. Then, I installed Docker Engine, using the official tutorial from the docker’s website. I won’t be going through this in detail, since it’s literally just copy pasting a few commands, and i’ll leave a link to it down in the video description. To run the Ollama instance, we’ll be using docker compose and this is what my docker-compose file looks like. Apart from Ollama, we’re also running Open WebUI, which is basically a ChatGPT-like WebUI for Ollama, that runs locally, and lets you interact with the large language models, and also download new ones right from your browser. Here, we’re setting up the Open WebUI, forwarding the port 8080, and pointing the WebUI container to the Ollama endpoint. Finally, we’re mounting a local folder inside the container, to make sure that our settings persist. Next, we’re using the ROCM version of the official Ollama container, to make sure that it utilizes our AMD graphics card. Then, we’re forwarding the port 11434, to use the Ollama endpoint with our text editor. We’re also mounting a local directory inside the container, to make sure that our models persist through the `docker compose down` command. And finally, we’re using the devices parameters to mount dev kfd and dev dri inside the container. These are the devices that belong to our GPU and that’s what’s going to let us utilize our graphics card. And that’s it! It’s a really simple compose file, and I’m gonna put it down in the video description. And if you develop your own Docker containers, you should take a look at `docker scout` It lets you scan your docker containers for security vulnerabilities, CVEs, license violations, outdated base images, and gives you a nice overview of all the problems and potential fixes for your docker images. It integrates with all of the popular CI/CD tools, and pulls data from multiple security trackers, so that you don’t miss a high impact vulnerability. Docker Scout also lets you set policies for your images, and make sure that the images that get uploaded to your company’s repo fulfil them. For instance, you can require that all images use a non-root user, or you can set a list of approved base images that the developers are allowed to use. Scout already comes with some common sense policies out of the box, and you can customize and tighten them to your company’s needs. Docker Scout is free for up to 3 repositories, and also comes with unlimited local image analysis – both for personal accounts and organizations You can check out the pricing for Team and Business plans down in the video description. And now, let’s get back to self-hosted AI. Before being able to run my docker-compose project, I also had to reboot so that Ollama would recognize my GPU, but after rebooting the machine, navigating to the compose folder, and running `docker compose up`, it kinda just worked! So I opened the browser on my Macbook, went to the server’s IP plus the port 8080, and registered the admin account. And there you go, we’re in. So, in order to test out the performance of our set ups, I downloaded a few models usually recommended in the community for code assistance – CodeLlama in the 7B version, oobabooga_CodeBooga (yes, really) and Starcoder in the 3B version.

Segment 3 (10:00 - 15:00)

I then asked Ollama to generate some Ansible code for setting up Ollama on Linux, using the CodeBooga model. Loading the model into VRAM for the first time takes a little bit of time, but once the model is loaded, the text generation is pretty snappy. and with a few extra prompts, it even produced something that didn’t look like a complete hallucination, which is nice. Looking at radeontop, you can see that the model takes up almost all of the VRAM on the card, so it’s definitely gonna be an interesting experience running models of that size on the LattePanda Sigma. And now that we’ve seen that our models work in a WebUI, let’s see them in action! So I opened VSCode, and installed the Continue plugin from the Marktetplace. I then opened the JSON configuration, and set the URL of my Ollama instance, as well as the model for the chat and auto suggestions I found it cool that you can specify multiple models for chat, and also set your code completion model separately. You’re probably gonna want to have a lighter 7B or even 3B model for auto-suggestions, and leave beefy 34B and 70B models for chat interactions. And after going back to the code editor, and creating a playbook. yaml file... well, you can see the result for yourself. The CodeBooga’s suggestions were pretty good. As you can see, it correctly recognized that I’m writing a playbook file, and not a role, and starts the suggestion with “code: all” Codellama 7B on the other hand, thought I was writing a task file for a role, but apart from that, the suggestions that it generated, were pretty good as well At the same time, the CodeBooga model, being a 34B one, was a tiny slower when it comes to auto-completion. In general, both models performed well enough, at least during my, granted, very limited testing. I’ve also tested both models out quickly with Python, and so far, all the suggestions that I’ve seen actually make sense. Now do keep in mind that some models perform better in some languages, and worse in others. So I’d suggest trying it out yourself, and seeing what works for you. Apart from code suggestions, Continue also has a chat functionality, which lets you chat with the model without leaving your text editor. ## Power consumption So now we get to the most interesting part of this experiment, at least for me, and that is power consumption. Now the build that I’m using to run Ollama has not been built with power efficiency in mind. It’s got a chiplet design AM4 Ryzen CPU, which are not known for great idle power efficiency. And it’s also got a high end AMD graphics card, which are likewise known to have issues with power consumption At idle, our AI server consumes around 63W, and while generating auto suggestions for our code, that number goes up from anywhere between 110W, all the way up to 425W. And the average works out to around 130W. Which is actually less than I expected. Interestingly enough, I didn’t see much of a difference between using a lighter model like Codellama 7B and using something like CodeBooga 34B. In both cases, the power consumption average stayed at that 130W mark. So considering that power draw figure, and considering the upfront cost of a decent GPU with loads of VRAM, the obvious question is – can we run this on something less power hungry and less costly, and still get decent results? In order to see that, I’ve installed Ollama on the LattePanda Sigma, which i’ve already made a video about in the past. With 16 gigs of RAM and a laptop tier Intel processor, it’s not really made to be an AI powerhouse, but that’s the kind of specs you’re more likely to see in a machine that you might have lying around More likely than a one and a half grand gaming PC anyway. So I configured the Ollama instance with the same docker-compose that I used for my gaming machine, with two exceptions. I’ve used the latest tag for ollama instead of ROCm and i also removed the mounts for the AMD graphics card, obviously. Then, I started the docker compose stack, and as you can see, even though Ollama didn’t detect a compatible GPU, it still started no problem, So, I went to the Open WebUI instance in the browser, downloaded the models, and after a few minutes, we’re now ready to start I decided to start with the CodeBooga model, to really stress test the little machine. Little did I know, running a model that big is out of question. The CodeBooga model needs 20 gigs of RAM, whereas we only have sixteen. So, CodeLlama it is. And even though that model worked just fine, the text generation was veeeeeeery slow. For comparison, here’s my gaming machine generating a response to the same prompt. Out of curiosity, I decided to try the starcoder:3b model, to see if using a smaller model would speed up the process. and even though the text generation was a tiny bit faster this time around, the result was 2 pages of the worst AI hallucinations I’ve ever seen. eventually, the model seemed to give up completely, outputting... whatever this is. Now once again, Ansible is not the most popular use case for these kind of models

Segment 4 (15:00 - 17:00)

and I imagine that both models would fare better with something like Python or Java. but nevertheless, we’re definitely not off to a great start. So let’s test this set up with code completions. So i went over to VScode, and changed the Ollama endpoint in the continue setting to the lattepanda sigma. I chose CodeLlama 7B for my completion model, went to the main. yml file, and started typing. Aaand… sigh guys, I really wanted this setup to be viable when it comes to AI-generated code assistance, but… the suggestions were so slow and unreliable, that I personally found it eaisier to just write the tasks myself. Unfortunately, changing to a lighter model, in this case, Starcoder 3b, didn’t really help. The few initial suggestions that it gave were fine, if a bit slow, but after the first task, it kind of just seemed to give up entirely. So I guess that a GPU really does give you a major boost when it comes to working with large language models, and without a dedicated graphics card with support for CUDA and ROCm, you’re probably gonna be stuck with a set up that is slow enough to be unusable in some cases. Like, for instance, code suggestions. Which is a shame, because in terms of power draw – this little PC crushes my gaming setup. The LattePanda sigma draws as little as 4. 6W at idle, and around 40 to 60 watts when working with large language models. So now we come to the biggest question of this video Who is this even for? Now don’t get me wrong The fact that you can run a large language model that’s comparable in quailty to something like ChatGPT Even on the most surface level at your own house, using free and open source software, and consumer hardware – that’s amazing. But at the same time, it basically needs a high end graphics card to work well. And building a thousand euro plus computer just to get some better code suggestions in your IDE is probably not the smartest idea ever. On the other hand, if you already have a decent gaming or a workstation PC this could probably save you a subscription to some SaaS product. Now, personally, I found the code suggestions produced by models like Code Booga and CodeLllama pretty useful. but… maybe not useful enough to run 130Ws of compute every time I want to write code. But what do you guys think? Would you run this set up yourself, or are you a neovim chad? Anyway, that’s gonna be it for this video, I hope you guys enjoyed it, and as usual, I would like to thank my patrons

Другие видео автора — Wolfgang's Channel

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник