I finally found a solution for my token costs, one that won't last forever...

8:14

I finally found a solution for my token costs, one that won't last forever...

Dreams of Code 14.04.2026 17 767 просмотров 679 лайков

Machine-readable: Markdown · JSON API · Site index

Смотреть на YouTube

Поделиться Telegram VK Бот

Транскрипт Скачать .md

Анализ с AI

Описание видео

This video is NOT sponsored (btw). Building an Agentic Video Editor has so far been really enjoyable, both from a technical perspective but also from a product one. However, there's been one thing that I've struggled with recently: Token Cost - Basically paying for tokens every time I want to test or develop my agentic features. Fortunately, I found a solution, one that's way more cost effective, but one that won't last forever. Links: - Fireworks: https://fireworks.ai - Kiru: https://kiru.app - Ollama: https://ollama.com Watch my course on building cli applications in Go: https://dreamsofcode.io/courses/cli-apps-go/learn 👈 My Gear: - Camera: https://amzn.to/3E3ORuX - Microphone: https://amzn.to/40wHBPP - Audio Interface: https://amzn.to/4jwbd8o - Headphones: https://amzn.to/4gasmla - Keyboard: ZSA Voyager Join this channel to get access to perks: https://www.youtube.com/channel/UCWQaM7SpSECp9FELz-cHzuQ/join Join Discord: https://discord.com/invite/eMjRTvscyt Join Twitter: https://twitter.com/dreamsofcode_io

Оглавление (2 сегментов)

Segment 1 (00:00 - 05:00)

If you follow along with my channel, then you'll know it's been a little while since I produced a video. This is because I've been working on another project, one that's been incredibly enjoyable from a software development point of view, but has perhaps caused me to do just a little bit too much hyper focusing. So, in order to get into a good work-life balance again, I've been pushing myself to get back into making videos, sharing not only what I've been up to, but also something interesting that I recently found, which has also managed to solve one of the biggest issues I've encountered with what it is I've been working on. So, what have I been up to? Well, other than attending a couple of conferences, I've also been heavily working on my next-gen video editing application, Kiru. For those who don't know, this is a desktop application I built entirely using Rust, which is meant to help me reduce the amount of time it takes for me to create videos for YouTube. So far, it's doing just that for both myself and my paying customers. Yes, I actually have users. I've got a video planned to talk about why I decided to build my own video editor, as well as do a deep dive into some of the technical decisions that I've made. For this video, however, I instead want to talk about an issue that I have been encountering quite a lot recently whilst developing some of the newer LLM-based features, and also share an awesome solution that I found to these problems, one that there's a good chance won't be around forever. So, what is this problem I was encountering? Well, it's to do with the cost of inference, specifically when it comes to developing and testing features that integrate with LLMs. To understand what I mean, let me first explain how Kiru actually works. One of the main features of Kiru is that I can drop in my raw unedited A-roll footage, and it'll automatically remove the mistakes, bad takes, and silences within a couple of minutes. Now, one might be excused for thinking that this is achieved through the use of an LLM, but up until recently, this was done entirely using a deterministic algorithm of my own creation. There's a few reasons as to why I chose to use this approach. However, the most major one was to do with development and testing, specifically related to token output and token cost. Recently, however, I've started to hit the limit of what a deterministic algorithm can do, and for some situations, an LLM is needed in order to determine whether or not a phrase is actually a retake or just something said similar. In addition to this, I've also been adding in some other LLM-based features, such as agentic mode, where you can chat with an AI agent to make edits to your actual video, or my personal favorite, auto place, which is where I can right-click an item inside of my media pool, and the LLM agent will automatically place it in the timeline where it best fits. As you can imagine, both of these features, when implemented correctly, can save a lot of time when it comes to editing video. The catch is, however, that in order to get these implemented correctly, it's required quite a lot of trial and error, and subsequently spending a lot on LLM tokens. Now, whilst there's always an associated cost when it comes to software development, LLM tokens had another dimension that I personally don't enjoy. For starters, whenever I work on a new problem, I like to start by building the least efficient implementation to get things working, and then iterate on that implementation to develop an optimized solution. In typical software development, this tends to work pretty well, but when it comes to building with AI and AI agents, and having to pay per token, then this can end up being quite expensive. On my worst days, I was paying just around $20 just for testing costs alone. Now, whilst this isn't a huge amount of money, it's far more than I really want to be spending when it comes to development, especially as I'm bootstrapped and not a VC-funded company with a ridiculous token budget. In addition to this, the fact I was constantly concerned about my token usage was also starting to affect my development speed as well, as I was having to make sure that I built in safeguards to prevent any runaway context. Whilst this is part of the challenge when it comes to building an AI agent, as I mentioned before, this meant I had to fully constrain myself before fully understanding the problem domain, which made implementing a solution just much more challenging. Because of this, I decided to look for a more cost-effective solution, and ended up attempting to use local LLM models, such as GLM 4. 7 Flash and Qwen 3. 5, both of which run pretty well on both the Framework Desktop and the new M5 Max MacBook Pro. Whilst using local models for development and testing did work well from both a cost-savings and correctness point of view, the biggest issue that I encountered was when it came to speed, not necessarily due to a lack of tokens per second, but instead when it comes to concurrency. This is because the prompts that I have in my system are embarrassingly parallel, which allows me to perform multiple LLM calls at the same time, something I couldn't really utilize when running against local models. This ended up making the iteration loop for development incredibly slow, and also made it impossible for me to gauge real-world performance, which is something I placed as a priority when

Segment 2 (05:00 - 08:00)

building this product. In fact, I managed to solve a number of hard engineering problems in pursuit of this goal, which is something I'll be sharing more about soon. In any case, all of this meant that whilst using local LLMs was incredibly cost-effective, it unfortunately just wasn't viable long-term, and so I needed to find another solution. Fortunately, I ended up discovering one that not only solved my immediate problems, but I've also been able to use it for a number of LLM-based applications. This solution is the Fire Pass plan by fireworks. ai, which is yet another LLM subscription model, but one that's slightly different from the others, and probably won't be around forever. By the way, this video is not sponsored in any way. This is just something I discovered on my own trying to solve one of my own problems, and is something I wanted to do a video on because it's been incredibly beneficial to my own workflow. So, what is Fire Pass, and what makes it so good? Well, for starters, the plan allows you to use the Kimi K2. 5 Turbo model from fireworks as much as you want. Yes, that means completely unlimited tokens, basically an all-you-can-eat token buffet. Of course, these unlimited tokens do come with a bit of a catch, in that the Fire Pass plan is only able to be used for heavy personal usage only. This means you can't use the plan to power a production system or serve other users, but you can use it when it comes to development and testing of local projects, which made it perfect for testing out my AI agent whilst I was developing. Not only this, but the personal usage definition also covers quite a large number of different use cases, such as being able to use it with a coding harness, such as Claude Code, CodeX, or Open Code, or using it to power your own AI agents, such as Open Claw, or if you happen to be rolling your own like I've been doing recently. Best of all, however, is that this plan only comes in at $7 a week, which, yes, is just $1 a day, far less than I was paying when it was pay-per-token. As for how to actually use the Fire Pass plan, well, this is incredibly simple. All you have to do is generate an API token in the fireworks. ai console, and then you can use any agent or harness that supports either OpenAI or Anthropic API schemas, setting the base URL to fireworks, and then the model name as follows. In my case, I actually use this with a new feature provided by Tailscale called Tailscale Aperture, which allows you to manage LLM API tokens for any devices in your tailnet in a central location, kind of like using an LLM proxy. I've actually got a video that I'm working on currently that shows off some of the cool things you can do with this new feature. Going back to Fire Pass, however, one thing to note is that the plan itself is currently defined as early access, which means it probably won't remain in this form forever. Personally, I fully expect the unlimited tokens to one day not be a thing, at least not in the current price point. Until that time, however, I'm just going to enjoy it and apply it to a number of personal use cases, such as testing my AI integrations, but also for powering a couple of custom AI agents that I'm currently building. And whilst the plan might not last forever, right now it's probably one of the best deals when it comes to integrating with an LLM. In any case, that's all for me for this video. I want to give a big thank you to you for watching, and I'll see you on the next one.

Другие видео автора — Dreams of Code

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Лучшие методички за неделю — каждый понедельник