# 15 Gen AI Cost Optimization Tips for Interviews and Real-World Projects

## Метаданные

- **Канал:** Cloud With Raj
- **YouTube:** https://www.youtube.com/watch?v=lpj9XqEyHjg
- **Дата:** 11.05.2026
- **Длительность:** 19:40
- **Просмотры:** 4,056
- **Источник:** https://ekstraktznaniy.ru/video/51694

## Описание

🚀Join the real-world SA bootcamp (Limited spots, 8th cohort enrollment opens 05/16/2026, past students got high paying jobs) : https://www.sabootcamp.com/
📰 Keep Up with the latest Gen AI & Cloud updates with me (FREE): https://cloudwithraj.short.gy/Q6MrN3

You don't need to go bankrupt to run Gen AI!

🎬Videos you might like:
Build Agents on AWS - Step By Step with Demo: https://youtu.be/epWeri7OQi8 
Build AI Agent That Gets YOU A Job: https://youtu.be/f7axOmeHA1Q 
Build AI Agent That Gets YOU Hired - Step By Step With Demo: https://youtu.be/Z2iN_xViV-0

🔥Get my courses with max discount: https://www.cloudwithraj.com/

Connect with me: 
🤳 Instagram: https://www.instagram.com/cloudwithraj/
🏢 LinkedIN: https://www.linkedin.com/in/cloudwithraj/ 
🐦Twitter: https://twitter.com/cloudwithraj

Don't forget to subscribe to be updated about more videos like this!

#genai #agent #interview

## Транскрипт

### Segment 1 (00:00 - 05:00) []

GenAI is expensive. Most GenAI projects overspend by 40 to 60%. If you are looking to upskill in cloud and GenAI, you must know GenAI cost optimization techniques. Most people just know few of the average ones. By the end of this video, I'm going to go over 15 different GenAI cost optimization techniques, and I have broken them down into three different tiers. Tier one easy, tier two moderate, tier three advanced. If you can remember five to seven of these techniques and tell that to the recruiter and interviewer, you will impress them. As well as you can save a lot of money in your real world projects. For those of you who are new to my channel, my name is Raj. I have personally switched my career from mainframe to cloud. Then I made it to Big Tech, which is AWS, where I worked on real production projects on cloud and GenAI. Currently, I'm building a stealth startup, and I have helped many transition their career to cloud and GenAI. All right, with that being said, let's jump into tier one quick win tips. Tip number one, use the right model for the right task. Even though all of you want to use Opus for everything, but that is an overkill if you are doing something like creating a chatbot, formatting a document, or summarizing. For those tasks, Haiku is more than enough, and it is quite cheap. Sonnet is a good balance between the cost as well as how complex operation you want to do. Actually, for most of the coding work and daily tasks, Sonnet is sufficient. Only use Opus for deep architectural planning and complex reasoning coding. If in your coding you are just creating a microservice, fetching or saving information from database, you can use Sonnet. If you are creating some parallel processing with a deeper complex business logic, then use Opus. Next one, start fresh conversations when you are talking about different stuff. Token cost grows exponentially because every time, by default, the models keep on sending either the full context or part of the context to the next invocation. So, for example, you created a new chat and you are talking about one topic and then if you want to switch topic and you don't want to go click that new chat button, we are all lazy, you can simply type {slash} clear and that will clear the context and will not send the previous context again for the next topic which has no relationship to the previous one. This one is a big one. Disconnect unnecessary MCP servers. MCP loads all tool definitions into context on every message. This is a overhead that you do not see. If you are using some tool which shows you how much token you are using per prompt and if you just say hi or I want to do this and you'll see it is using lot of tokens already and you may be wondering, wait, my prompt is very small or I didn't even start working yet. Why is all these tokens in the context? It's because it's sending all the MCP tool information. Instead, use skills and CLIs. Skills are only loaded when they are needed. So, you can have bunch of different skills and only the name and descriptions will be loaded in the context, not the full details. Only when the prompt matches what the skill does, then the whole thing will be loaded only for that skill. Okay, CLIs are the next advanced thing. So, let me explain. Let's say you want to list the S3 buckets in your AWS account. You can do that using MCP using AWS MCP tool, but then it loads everything. You can also do that using skill which it will load some stuff, but if you think about it, all these large language models are already trained on all the AWS CLI commands. So, large language model knows that it simply needs to run AWS S3 LS and it will get the S3 buckets. I predict that in the future, agents will be using CLIs over MCPs. So, CLI plus skills are faster and cheaper than MCP and that's what will be used. Real quick, guys, if you want to get a AWS Solutions Architect job without coding or learning every AWS service, waitlist for the next cohort of SA Bootcamp where we cover technical, behavioral, executive communication, hands-on, LinkedIn resume improvement, mock interviews, and more. Find details and waitlist at sabootcamp. com. All right, back to the video. Next one, maintain cloud. markdown file. So, cloud. markdown file is a file which gets injected into the system prompts with your prompt. So, you need to be careful what to include, but include the tech stack, coding convention, and project structure. Um so, for example, uh let's say um you are coding for a project and you want every code to be Python. So, put that there.

### Segment 2 (05:00 - 10:00) [5:00]

Hey, use Python version 3. 11 in all the code. So that on each coding session, you don't have to repeat yourself because sometimes this large language model can create code in Python 3. 10, sometimes in Node. js, but because it's in cloud. md, you save tokens going back and forth on setting up these conventions. Also save the architectural decisions. For example, if I need to fetch something from the orders table, always use microservice. If the orders application need to send something to the uh nightly batch application, I should be using event-driven architecture, etc. So, this markdown document is like a index pointers to detailed docs. So, you can have detailed uh design discussions or detailed uh standard on how to create microservice and you can just put that link in this markdown file so that you can just change in one place and markdown file automatically gets it. So, you don't have to maintain in two different places. Why does it save token? Like it's quite natural, right? It this one is acting as an index and Claude knows where to look without searching. This claude. markdown is for Claude, but if you are using GPT, it's agent. markdown, same concept. And it prevents Claude or GPT from exploring wrong paths. You don't have to like top it and then re-prompt it and all that stuff. For all the main architectural coding best practices, it's all there in the claude. markdown file. And it saves back and forth conversation tokens. Tip number five, give all related tasks in one go. So, this is a costly approach. Let's say in a session, first you say, "Hey, summarize this file. " Then you go back and forth, then say, "Okay, now extract the issues. " Then you go back and forth, then you say, "Suggest a fix. " Because each message re-reads all prior context. Instead, give the context and then say, "I want you to summarize this file, extract the issues and suggest fixes. " Another reason is, of course, there is less back and forth and there is no repetitive uh, reading of the same information by the large language model. Another bigger thing is the better final picture you give the large language model, the better right? So, if you tell, "This is your final goal, this one, two, three. " Then it can also plan better. Okay, next tip, monitor your costs. There are multiple ways to do that. You can run command/context, it will show you what's eating your tokens, like history, MCP, files, etc. You can run /cost and it will show you the token usage and spend for the current session. You can also create a separate status line or use third-party tool like client. Um, so I use client. Client shows you how much token it is using per prompt, cost, all that stuff. Or if you are using Anthropic or OpenAI, they both have separate dashboards where you can track the usage and pace yourself between the quota resets. For those of you who are using it the AWS ecosystem like Bedrock, you can set up CloudWatch alarm and AWS budgets with SNS alerts. All right, now let's go to tier two intermediate cost optimizations. Use memory. So without memory, LLM goes through the similar context to derive same insights. You have to re-describe the preferences and decisions every time. So think about it, you are creating a travel agent, and maybe you prefer to stay indoor when it is cold out, right? So this travel agent should go look up indoor activities. But imagine you are planning these activities, and you have to leave, and you need to come back afterwards, then if you don't have a memory, you have to re-describe all that preferences again. Whereas, if you use memory, large language model will remember your preferences, summary, and some of the other stuff, so you don't have to re-describe all that stuff and burn precious tokens. Use cloud. markdown plus memory to make it even more effective. Pro tip, you can manage memory yourself. So memory at the end of the day is basically some storage or some vector database. So you can use your own memory, but that's a lot of overhead. If you are in the AWS ecosystem, Amazon Agent Core, which can run agents for you, can also manage the memory for you. So I highly recommend you using that. Next one, be mindful of agent costs. So I'm seeing this in my students as well. They're like, "Raj, I want to use sub-agent. agent teams. " But they sound cool, but they're very expensive. If a single agent is using one X of token, the sub agent uses three to five X because one sub agent doesn't know what other sub agents are running. So, they each have their own separate context, and because they don't know what each other is doing, sometimes there is repetition in the context. Agent team is super expensive where each agent are interacting with each other and running

### Segment 3 (10:00 - 15:00) [10:00]

different things with message bus with full context. So, think the task you are doing, does it really need all these fancy multiple agents, right? If not, just use single agent, not a problem. Okay, next one. This one is catching up to people. Vector database can be expensive. So, by default in AWS, it suggests OpenSearch serverless as the vector database. The good news is the latency is low, but it is very expensive. Even though the name says serverless, even if you are not using it, it charges you for the amount of the core it reserves for you. So, it's best for production rag and real-time search, but you don't need to use it in test or even in some production use cases. Um so, a good middle ground is Aurora PostgreSQL or PG vector where the latency is still low, and if you are already using SQL database like RDS, then it's a good in between mix. So, you can reuse SQL as well as the vector search. Okay, the new enhancement which allows you to use S3 bucket as vector database is the best cost-effective option. The latency is little bit higher. It's like way cheaper than using OpenSearch or Aurora PostgreSQL. So, sometimes you get confused, and I cannot give the numbers because the latest changes based on how much information you have in this vector database. So, test for your application, right? So, even though you think this is higher, this might be within your SLS. This could be milliseconds. So, if you are doing batch processing or cost application or even production rag, test out this S3 vector database and see if you can make it work. Next tip, clean up your rag document. So, this one again, I'm seeing this in the production workload. So, sometimes this companies will run this rag workflow every night. So, now when it runs every night, every week, every month, and then in a year, because rag is out for a couple years now, you have to be careful. You have to go back and remove the old documents, because every document is equals to embedding store, and old documents you never query still cost you money every month. And not only that, first of all, your rag process needs to go and go through a increased number of embeddings, so it takes more time. Second, sometimes it retrieves older irrelevant chunks, and then it comes as context, and the output results is not as good. So, then you have to think about, "Oh, I need to re-rank this. " Thanks to do semantic searching, da da da. So, if you just go and clean up the older documents that you don't need, it saves you headache, latency, and money. All right. Now, let's go to the TR3 advanced strategies. Add a semantic caching layer. So, the semantic caching sits in front of the agent. An example is Redis LangChain. Now, this name semantic has a special meaning. This cache can extract what intent the underlying intent or what you are trying to do. For example, first time the user asks, "How do I reset my password? " There's nothing in the cache, it goes to the agent, goes to the LLM, gives the user the answer, also saves it in the semantic cache. Next time, another user asks, "I forgot my password, what do I do? " So, those are two different queries, but the underlying intent is same. The traditional cache cannot work on the underlying intent, but semantic cache can find out that, "Okay, this is the similar query as before. So, instead of going to the agent and LLM, you just returns the answer from the semantic cache. " You save a lot of money because you do not need to hit the agent and the underlying large language model. All right. So, the next method is called the Carpati method or large language model wiki. So, this is a new one. I'll give a little bit of screenshot on the screen. In this method, you maintain a markdown file of separate insights, which is a living knowledge base. The concept is this, you keep a running markdown of lessons learned, patterns, and decision. Each insight is a one-liner. No need for long descriptions. And then, this get fed to the LLM as structured context instead of rediscovering the same things. So, think of this as Claude. markdown file on steroids. So, let's say, you are working in a company and you deal with different customers, this LLM wiki could be customer A, big challenge this, the way it can be resolved is this link, to do this. Customer B, this this. Or, you can think of different projects, the insights gained, the challenges, all that stuff in a short, structured manner. So, when you start working on that project in the future, the large language model has all that information and insights in a structured way, and then it can go from there. You don't need to back and forth. It doesn't need to go search the whole knowledge base or search your whole folder, all that stuff. How does it save tokens? So

### Segment 4 (15:00 - 19:00) [15:00]

no re-explanation of past failures, no rediscovery of workarounds, compact structured context versus verbose conversation replay. So, it works with Claude. markdown applied learning section. Next advanced tip, use model distillation. Train a smaller model to mimic a larger one. Same quality with fraction of a cost. So, let's take an example. Let's say, you have Claude Opus and you want to use that to categorize production support tickets. So, if you think about it, that's not really that complex. The way you could do that is you can ask Claude Opus with maybe 10,000 tickets and say what should be Should it be low severity, medium, or high? And you gather the output and you use that to train a smaller, cheaper model such as Haiku. So, you ask these 10,000 tickets, gives the output, you train the student model, then you ask the future ticket, so 10,000 to once ticket, you ask this Haiku, which is much faster, much cheaper, and it will be able to predict it. The results, based on the data, is 90% plus quality, but 10 to 50 times cheaper, and it is much faster. So, for those of you who are in the AWS ecosystem, Amazon Bedrock supports model distillation. You can use it to create task-specific models from larger foundation models. Next, tip number 14, use smaller models. I sincerely believe this is the future because every model provider is increasing the cost of this foundation models. The smaller model, so use a smaller model fine-tuned for specific task. So, distillation it is a good way to start. Now, think about this student model, which is so small that it can even run on your smartphone or your laptop, right? And you can fine-tune it for your particular purpose. The reason these models, the large language models, are so big is because it can do a lot of things. But in reality, you don't need that model to do everything. You will have a use case for your production application, and that's what you need it to do. You can run the model locally. Small language models are cheaper and faster. Some of the examples are Google Gemma 4, 53. 5 mini, etc. Now, a pro tip, if you can do this proof of concept and showcase this in LinkedIn and GitHub, it looks really good on your body of work. Next tip, at the end of the day, the GenAI applications running on the cloud is just another application. So, you should apply all the cloud best practices that you already know. Apply for enterprise discount if you are using GenAI a lot, this gives you a generous amount of discount. If you are run in Bedrock, use reserved capacity instead of on demand. If you are running the model yourself, use spot or on demand for EC2 inference. Spot is way cheaper than regular on demand instances. Right size, evaluate if you are hosting your own model for inference, evaluate if spot instances can do the task. Spot instances give you over 90% savings. Right size the instances, don't run G5 for a task that works on M5. One thing is for regular application workloads, even if there is no traffic, it's fine, but for GenAI workloads, you should scale the inference end endpoints to zero when idle. Use cost allocation tags like tag by team, project, environment for par agent cost visibility. One bonus tip, for those of you who are in the AWS ecosystem, Bedrock gives a lot of techniques out of the box. You can use batch inference for non-real-time tasks, this is 50% cheaper than on demand. You can do prompt caching, which reduces the cost for 90%. This is different than semantic hashing, so study it up. Use guardrails because Bedrock guardrails block malicious and irrelevant inputs before they hit the model. It saves wasted inference and cost. Use Bedrock model evaluation, I talked about using the right model for the right job. So, Bedrock has model eval out of the box, so use it to find the cheapest model that meets your quality bar. All right. So, your homework, watch this video and try to remember five to seven of these tips. And this question is coming up in almost all interviews. So, you should be able to say that to the recruiter and the interviewer. If you like this video and you want to know how Agent Core manages the memory or how you need to create a agent with memory, I have a separate video with step-by-step demo. So, watch that next. I'll see you all in the next one. Bye.
