Claude Code Skills Just Got Their Biggest Update Yet
12:48

Claude Code Skills Just Got Their Biggest Update Yet

Ray Amjad 04.03.2026 3 513 просмотров 186 лайков

Machine-readable: Markdown · JSON API · Site index

Поделиться Telegram VK Бот
Транскрипт Скачать .md
Анализ с AI
Описание видео
Level up with my Claude Code Masterclass 👉 https://www.masterclaudecode.com/?utm_source=youtube&utm_campaign=qXWz-V_XMOc 🎉 Use BIRTHDAY at checkout for a Claude Code Birthday discount. Learn the AI I'm learning with my newsletter 👉 https://newsletter.rayamjad.com/ To avoid bias, I've never accepted a sponsor; my videos are made possible by my own products... —— MY CLASSES —— 🚀 Claude Code Masterclass: https://www.masterclaudecode.com/?utm_source=youtube&utm_campaign=qXWz-V_XMOc —— MY APPS —— 🎙️ HyperWhisper, write 5x faster with your voice on MacOS & Windows: https://www.hyperwhisper.com/?utm_source=youtube&utm_campaign=qXWz-V_XMOc - Use coupon code YTSAVE for 20% off 📲 Tensor AI: Never Miss the AI News - on iOS: https://apps.apple.com/us/app/ai-news-tensor-ai/id6746403746 - on Android: https://play.google.com/store/apps/details?id=app.tensorai.tensorai - 100% FREE 📹 VidTempla, Manage YouTube Descriptions at Scale: http://vidtempla.com/?utm_source=youtube&utm_campaign=qXWz-V_XMOc 💬 AgentStack, AI agents for customer support and sales: https://www.agentstack.build/?utm_source=youtube&utm_campaign=qXWz-V_XMOc - Request private beta by emailing r@rayamjad.com ————— CONNECT WITH ME 🐦 X: https://x.com/@theramjad 👥 LinkedIn: https://www.linkedin.com/in/rayamjad/ 📸 Instagram: https://www.instagram.com/theramjad/ 🌍 My website/blog: https://www.rayamjad.com/ ————— Links: - https://claude.com/blog/improving-skill-creator-test-measure-and-refine-agent-skills - https://github.com/anthropics/skills - https://github.com/coreyhaines31/marketingskills Timestamps: 00:00 - Intro 00:15 - Status Quo 01:14 - Anthropic's Solution 01:46 - Categories of Skills 03:49 - Anthropic's PDF Skill 04:03 - Comparison with Older Skill Creator 05:08 - Example of Using It 07:33 - Types of Evals 08:18 - Benchmarks 09:20 - Improving Skill Triggers 11:06 - Conclusion

Оглавление (11 сегментов)

Intro

Alright, so Anthropic just made it much easier  for you to make better skills for Claude Code   and for Claude Cowork. And that's exactly what  we will be going over in this video. But before   getting started, if you are interested, there  is a sale going on right now in my Claude Code   Masterclass to celebrate the 1-year birthday  of Claude Code. It is the most comprehensive

Status Quo

class on Claude Code that you will find online,  and many people from hundreds of companies have   taken it and have gone on to be the best Claude  Code users at their companies. Okay, so right   now most people are developing Claude skills  based exclusively on vibes. And what they're   doing is they go through the process once  with Claude Code, and then they say, hey,   can you turn this into a skill? They may give it  additional resources such as a blog post or an   internal document or something else to help  him make a better skill. And then they will   try the skill a few times to make sure that it  works. They'll be like, hey, this looks good,   and then ship it to the rest of the team to  use or just online. And many people have made   some really great skills with this approach. For  example, this GitHub repo of marketing skills. But   there are a couple problems here. Firstly,  whenever we have a brand new model update,   it may be the case that your skill is actually no  longer helping Claude Code because a lot of the   ideas and functionality you encoded inside of your  skill have now been encoded into the model. Or the   model can do a better job than your skill can. So  by triggering your skill, it's actually holding   itself back from reaching its true potential. And  also right now you don't really know if making a   change to a skill will lead to any better output.   So what Anthropic did is they made a whole bunch

Anthropic's Solution

of improvements to their Skill Creator skill  to make it easier for you to both write skills,   see if they're actually making a difference,  run evals to make sure they're being triggered   reliably, and a whole bunch of other things as  well. So we'll be going for an example later on   in the video, but basically you would use this  brand new Skill Creator skill to help make you   one. Then you may want to run your own evals  to make sure they're being triggered reliably   in the way that you want to be triggered. And you  may want to do an A/B test to make sure that the   skill you just developed is actually making  a difference. So Anthropic says that skills

Categories of Skills

generally fall into two different categories.   The first of which is capability uplift. So   essentially right now the model may not be smart  enough in a certain domain. So it may not know   how to handle Swift concurrency properly, or it  may be really bad at filling out PDFs or making   PowerPoints right now. And this kind of skill  basically gives a model missing information or   provides techniques or patterns that it can use to  achieve whatever goal you have. So as an example,   in the Anthropic official repository, they have a  bunch of skills such as handling Word documents,   PDFs, and PowerPoints. And the reason Anthropic  made a PDF skill is because right now the models   aren't really that great at handling PDF-related  tasks. Or it may be the case that Opus 4. 5 or   Opus 5 becomes much better at handling PDFs and  you no longer need to have the skill. So usually   capability uplift skills that basically fill in  missing information or teach a model techniques   have their own retirement date. And the Skill  Creator skill can help you determine whether you   should get rid of that skill or not because the  base model capability has caught up to the level   of the skill. The next category of skill basically  encode workflows or preferences that you have. And   that could be because of compliance-related  reasons or your system is just designed in a   certain way. A quick example of that skill would  be the Windows Release skill for my application   Hyperwhisper. It's my voice-to-text application.   And when I want to do a new Windows release,   like after making an update, then I just trigger  the skill and it goes through the entire workflow   that I defined before. But you still want to  make sure that when you're developing your skill,   it's still being triggered reliably and it's  doing the right thing that you would expect,   especially if the skill is pretty complicated.   So capability uplift can include filling out PDF   forms, using OCR to help you know which  part of the form should be filled out,   and then also making complicated documents as  well. And that would be like PowerPoint documents   and Word documents. Skills that would encode  preferences would be like an NDA review checklist,   another skill that compiles data from all your  different MCP servers into a weekly report,   like PostHog and like Jira or something. And then  another skill that has a very specific like flow   for code reviews. So for example, Anthropic found  that their PDF skill struggled with forms that did

Anthropic's PDF Skill

not have like fillable components. And it always  placed things in the wrong area. So then they used   their brand new method to isolate the failure make  improvements to Skill, and then it finally started   working consistently. All right, now let's go for  a quick example of how you can use this. Firstly,

Comparison with Older Skill Creator

you want to go to /plugins and make sure that  you have the Skill Creator plugin installed. So   searching it over here, and then we can install  it on this project level in our case. And then   after restarting Claude Code, if I do /Skill  Creator, I can see that I actually have two. And   one of these is an older one that I installed  a couple months ago. And that exists in my   user level. So I'm going to delete this from  the user level Claude. md file, delete that,   and that has now disappeared. Anyways, we can ask  the Skill Creator skill like, hey, so can you tell   me what you can do? And just to quickly compare it  to what the old one says that it can do, they can   both make skills from scratch, but the new one can  create test cases to see how the skill performs.    The improving existing skills already has some  benefits, so it can run test prompts and comparing   against the baseline. It can identify what's not  working and revise instructions, and then also   run benchmarks for us. And then finally can run  an optimized feedback loop for us where it will   test different skill descriptions to see which one  would trigger reliably against a realistic prompt.    Okay, now let's go through the entire process so  I can explain how it works. All right, now let's

Example of Using It

use this Skill Creator skill and say, "Make me  an SEO audit skill. " Press enter. And ideally   we would want to be more descriptive or give it  like reference documentation it can use to make a   better skill. But I'll start off with a pretty  simple description. To rely on the underlying   model's capabilities. So when asking questions,  it now says, should we set up test cases to verify   the skill works well? And I'm going to say, yes,  run evals as well. Okay, so now it made us a skill   and now it will make us some test cases as well  with realistic prompts. So these are now the evals   that it came up with for us. And now it's going  to start testing the skill to make sure that it   actually makes a difference. So it's launching 6  runs in parallel. 3 with the skill and 3 without.    And then it will grade the results against  expectations that it has in this expectation   list. So we can see these agents running by doing  /tasks. And all of these are running, one without   the skill and one with the skill. Go to any of  these and we can kind of see what's happening   behind the scenes. So essentially for each eval,  it spawned up 2 subagents, one with the skill,   one without the skill. And then it goes back into  the main session, which has a comparator. And the   comparator just compares both of the outputs,  but it doesn't know which one used a skill and   which one did not use it. So this applies to  making a skill because you can actually make   sure it's making some kind of difference in your  codebase. And also when we get a model upgrade,   you can reevaluate your most critical skills that  you use every single day with the Skill Creator   and be like, hey, can you run an A/B test to make  sure that this skill is still performing well? So   it may be the case that right now Opus 4. 6 isn't  really that good at SEO auditing, but Opus 5 will   be. So when Opus 5 comes around, you can then run  this skill comparison A/B test on any pre-existing   skills that you have made and then determine  whether you should delete the skill, keep it,   or change it. In the case of Anthropic, they  benchmarked their PDF skill with and without the   skill loaded, and they found that having the skill  loaded led to a better pass rate compared to not   having it loaded. So it seems that all 6 runs have  been completed, and it's launching 6 graders of   agents to grade them in parallel. And if we look  inside our project, it made a SEO audit workspace.    This is the first iteration that I came up with,  with the skill, the report that I came up with,   any grading according to the evals that we had  defined, and then also some timing information. So   repeated this 6 times. But yeah, let's wait until  the grading has been completed. So essentially we

Types of Evals

have 2 types of evals that can happen. One of them  being capability evals of which output is better,   and that is like the SEO audit is better. The PDF  fields have been filled in correctly. And then   we have procedural evals. So you could have an  insurance claim triage skill and you made it for   your organization. You fed in all the internal  documentation and then you gave it like 20,   30 examples with the result that it can do  evals against. So for example, you could say   that if the insurance claim value is greater than  $10,000, it requires a missing police report. If   it's an injury claim, then it requires medical  records as well. Now with this new framework,   you can basically give the Skill Creator skill  your insurance claim triage skill, give it a   bunch of examples, and then have it automatically  grade and make improvements to the skill. Okay,

Benchmarks

so now it's given us the benchmark results. And it  says with the skill enabled, the success rate is   13. 5% higher. The average time to complete the  task is 22% faster or lower. And also it uses   slightly more tokens to have the skill enabled. So  it says there are some HTML reports that I can see   in the viewer, but it hasn't open up my browser  for me. And now it's opened up the report for me   right over here. So I can see the outputs. And  these are basically long HTML files for the like   SEO reports and then the grade that it's given  each output. And I can leave in my own feedback   here, go over to the next output and review that  as well. And if I review all of them, so I can   press submit all reviews. If I go to benchmarks,  I can see the benchmarks that we saw before,   which assertions were passed with the skill and  without the skill. And then some final analysis   notes as well. So if I were to press submit all  reviews, it would download a feedback. json file.    And then I would just drag and drop this file back  into Claude and then say like, okay, make these

Improving Skill Triggers

improvements. Now when you have skills, then your  skill description determines whether or not it's   going to be triggered. And there may be some cases  where you don't find your skill being triggered   reliably, in which case you can tell Claude Code  with the /skill Creator skill to improve on the   trigger. So for example, I have this skill over  here and I can open up a new tab, go to Claude, go   to Skill Creator, and then say, can you optimize  the description of my sync model skill to make   sure it triggers more reliably? And now it comes  up with some example prompts of what a user would   say, where it should be triggered, and then some  prompts of where it should not be triggered. All   right, so now it gave us a brand new review thing  so we can review these prompts and decide whether   or not these skills should trigger. Add our own  like queries as well. And then once we're done, we   can export this and then we can drag it back into  Claude Code. So pressing enter. So now it's going   to go through its optimization loop where we have  20 different queries. It will train on 60% and   keep 40% for testing. So it's kind of like machine  learning just generally where you have a test set   and you have a training set. So essentially the  way that it works, we have our queries that are   split up into a training set and a testing set.   Claude then fires queries at all of them in the   training set. It then checks whether the skill  was actually called or whether it was triggered.    It doesn't actually run through the entire skill,  it just checks whether it was triggered. So this   will repeat up to 5 times until it makes a better  description. And you can see right now it says the   description is too generic for the skill. Claude  thinks it can handle these tasks without actually   consulting the skill. So now it's going to write  a more optimized description and then go through   another cycle just to check. And in Anthropic's  blog post, they found that through this process   they can have the skill trigger more reliably.   But yeah, it is pretty interesting how heavily

Conclusion

Anthropic are investing into skills. And you may  think after watching this video, like, this seems   like a bunch of effort, why would I bother? But  I think the reality is that if you have a skill   that you're using multiple times a day and you're  going to be using it for the foreseeable future,   the small amount of time that you put into making  sure that A, it's actually delivering better   results and B, triggering reliably is going to  pay off in the long term. Because nowadays we're   finding that some people's jobs are essentially  being replaced by a couple of Claude skills. So   I will be running this against some of the other  skills that I have across my projects to make them   better. And I would recommend running it on your  own projects as well, especially for the skills   that you use the most often. And for skills that  you made a couple of months ago, because it may be   the case now that skill is no longer required  because the model's core behavior actually encodes   all the functionality of that skill itself. This  can also be useful when you're publishing skills   or downloading them from online because you can  actually check with the Skill Creator to make sure   that it's making a difference in your codebase.   Anyways, if you do want to learn more about   Claude Code and everything that I'm learning,  then you can join my Claude Code newsletter. It   will be linked down below. And I basically share  anything that I'm learning or thinking about or   investigating And by signing up, you get access  to a bunch of free videos from the Master Claude   Code class. If you want to upgrade and access  all the videos, then you can use a coupon code   BIRTHDAY to get a discount because it's Claude  Code's 1-year birthday. There are a bunch of   classes covering pretty much every single feature  in Claude Code, some about context engineering,   some about my own daily workflows as well, like  what I'm doing on a day-to-day basis, and then a   bunch of bonus techniques as well that can make  sure that you're prompting more effectively and   addressing common failure patterns that you may  find Claude Code and other agents falling into.

Другие видео автора — Ray Amjad

Ctrl+V

Экстракт Знаний в Telegram

Экстракты и дистилляты из лучших YouTube-каналов — сразу после публикации.

Подписаться

Дайджест Экстрактов

Лучшие методички за неделю — каждый понедельник