How I Design Claude Code Skills
Published:
How I Design Claude Code Skills
Over the past six months I’ve built eight Claude Code skills—session analysis tools, Docker research patterns, a full system-building methodology. Two are public (high-performance-coding, ai-session-analysis); the rest live in a private repo and are loaded as project-local or global skills depending on scope.
I didn’t set out to write eight skills. I built a series of systems—ChatPD, SkillFab, s2s, DataQuery, multi-agent-topo—and at some point noticed I was carrying the same hard-won patterns from one project to the next. Writing them down as skills was the natural endpoint of that process.
Anthropic published their Agent Skills post in October 2025, introducing a three-level progressive disclosure mechanism and framing skills as “organized folders of instructions, scripts, and resources that agents can discover and load dynamically.” It’s a strong technical specification. But there’s a gap between the spec and the practice—between understanding the format and knowing what’s worth putting in it. This post is about that gap.
Skills Are Extracted, Not Designed
The single most important thing I’ve learned: a skill is crystallized consensus from repeated experience. It comes after the builds, not before.
I didn’t design a system-building methodology in the abstract. I built ChatPD (a 270K-paper data pipeline with months of debugging), then SkillFab (an agent skill platform with its own operational surprises), then s2s (a benchmark that taught me about task design), then DataQuery (where the Skill Library pattern emerged). Only after the fourth project did I look back and notice: the same five phases kept appearing, the same failure modes kept recurring, the same checkpoint patterns kept being reinvented.
There’s a rule I’ve come to trust: don’t extract a skill until you’ve seen the pattern in at least three projects. The first time you see something, it might be a fluke. The second time, it might be coincidence. The third or fourth time, it’s a pattern. System-building was extracted after ChatPD, SkillFab, s2s, and DataQuery. High-performance-coding came from profiling and optimizing pipelines across all four. Each skill encodes something that actually went wrong, repeatedly, across multiple systems—not something that could go wrong in theory.
Anthropic’s skills post recommends the same approach from the evaluation side: “Run agents on representative tasks, identify gaps, build skills incrementally to address shortcomings.” The principle works from both directions—whether you’re extracting experience into a skill or testing a skill against real runs, the raw material is always actual failures, not anticipated ones.
How I Structure a Skill
After writing enough of these, a structure emerged. Not because I planned it, but because every skill that worked well ended up having roughly the same shape:
1. YAML frontmatter with bilingual triggers. The description field is the skill’s “if statement”—Claude uses it to decide whether to load the skill. Mine include both English and Chinese trigger phrases, because that’s how I naturally speak about each topic in Claude Code. “我要搭一个XX系统” triggers system-building. “帮我研究一下” triggers research-gate. “断点续传” triggers high-performance-coding.
The skill body itself stays in English because Claude Code operates in English, but the trigger conditions match my actual speech patterns.
2. A short “what this is for” section. Two or three sentences. Not a manifesto. Enough to tell Claude (and future me) what problem this skill solves and when to reach for it.
3. The anti-patterns table. This is the most-read section of any skill. Nobody reads long-form guidance in the middle of work—they scan. The table is the scannable version of the entire skill.
Each entry has three columns: a name (memorable, specific—”Print-based logging,” not “inadequate observability”), what it looks like (a realistic thought someone would actually have: “It’s just a prototype, I’ll add tests later”), and what to do instead (concrete, with cross-references: “Phase 1 gates on tests. ‘Later’ never comes.”).
Five to ten entries. Each one something I’ve actually caught myself doing. This constraint is important—if an anti-pattern hasn’t happened to me personally, I don’t know enough about it to write a good entry.
4. The main guidance. This is where the skill earns its tokens. Specific, grounded in real code, organized by decision point rather than by topic. Not “here is everything about performance optimization”—that’s a textbook. “When you notice X, reach for Y first” is what a skill should say.
5. References to bundled resources. Larger skills split detailed material into separate files loaded on demand: reference docs, code templates, checklists. Anthropic’s progressive disclosure design makes this practical—Claude reads the skill body on match, then navigates into sub-files only when needed.
Gates at the Right Places
Some decisions are too expensive to get wrong. For those, the skill blocks progress until a checklist completes.
What makes a decision worth gating? Three tests, discovered by getting all three wrong at different times:
- The cost of being wrong is high. Scaling to 5000 units without validating on 50 burned days of compute on an early ChatPD run.
- The fix is cheap early, expensive late. Adding structured logging on day one is trivial. Retrofitting it across 20 modules after the fact took an entire afternoon.
- The decision gets skipped under time pressure. “I’ll add tests later” was the most reliable predictor of never adding tests, across every project I’ve built.
But—and this matters—gates have to be rare. If every section is a gate, the agent learns to treat all of them as optional. In system-building, four of five phases have gates. In research-gate, five checkpoints guard four phase transitions. The pattern held: each gate is at a genuinely irreversible decision point, and there are few enough of them that the agent takes each one seriously.
Anthropic’s harness design post describes a related pattern they call the Sprint Contract: before writing code, the generator and evaluator negotiate exactly what “done” looks like. A gate is a Sprint Contract with teeth—it doesn’t just record expectations, it refuses to proceed until they’re met.
Gates Produce Audit Trails
Every gate I design requires producing structured output. Not a form to fill out—a compact artifact that proves the gate was actually engaged with:
Phase 3 complete.
Bottleneck: serial regex matching across 270K records
Fix: compiled regex + chunked parallel processing
Speedup: 23x
Tests passing: all 47
Two purposes. First, it forces real thinking—you can’t produce this output without having actually profiled and optimized. Second, it creates a record. Six months later, when I’m wondering why a particular architecture decision was made, the gate outputs are there, written at the moment of decision, not reconstructed from memory.
This is the same instinct behind DataQuery’s plan.json and s2s’s Sprint Contract verification. Before acting, write down what you expect. After acting, compare. The artifact bridges the gap between intention and outcome.
Self-Contained Over Orchestrated
When given the choice between a skill that delegates to other skills and one that stands alone, I choose standalone. Always.
The reason is boring but empirically true across every skill I’ve built: delegation chains break silently. If skill A triggers skill B which triggers skill C, and B’s conditions don’t match the current context, the chain fails with no error message. The agent just doesn’t do the thing.
Cross-reference is fine—”optionally use the ai-session-analysis skill to automate this step.” But a skill should work without depending on other skills.
What I’d Do Differently
Looking back at eight skills, a few patterns I wish I’d adopted earlier:
Start with a shorter description field. My early skills had description fields that tried to list every trigger condition. Claude is better at semantic matching than I gave it credit for—a concise description with 2–3 representative trigger phrases works better than an exhaustive list.
Anti-patterns first. In early skills, I buried the anti-patterns table in the middle. In later ones, it’s always the first substantive section after the intro. When I’m mid-task and scanning, that’s what I want to see. When Claude is mid-task and scanning, same thing.
Prune after model upgrades. Anthropic’s harness design principle—”every component encodes an assumption about what the model can’t do on its own”—applies to skills too. Some guidance that was essential for Claude 3.5 Sonnet became dead weight for Opus 4.5. After every major model release, I now re-read each skill and ask: does Claude still need to be told this?
When Not to Write a Skill
Not everything deserves to be a skill. I’ve held off on writing one when:
- The pattern hasn’t appeared across multiple projects. One project is an anecdote.
- The guidance is obvious to anyone with basic competence in the domain.
- I’m still figuring out the pattern myself. Let it stabilize.
- It would work better as a reference document or code template—a skill should encode methodology, not just information.
The bar is: can I point to a specific session or project where not having this skill caused a measurable failure? If I can’t, the skill probably isn’t ready.
The Current Set
| Skill | Type | What triggers it | What it actually does |
|---|---|---|---|
| high-performance-coding | Global | “optimize”, “断点续传”, “make it faster” | Bottleneck-driven performance methodology distilled from profiling ChatPD, s2s, and DataQuery pipelines |
| ai-session-analysis | Global | “analyze my coding sessions”, “what tools do I use most” | Zero-dependency Python scripts extracting tool usage patterns from Claude Code, Codex, and Kimi Code sessions |
| system-building | Global | “我要搭一个XX系统”, “帮我构建一个平台” | Five-phase methodology: scaffolding with tests, test-first core, profiling-driven optimization, staged scale-up, retrospective |
| docker-research | Global | Docker, containers, docker-compose | Sandbox lifecycle patterns from s2s and DataQuery: build once, exec many; base64 command encoding; snapshot/restore |
| research-gate | Global | “帮我研究一下”, “can we investigate”, “I want to study” | Five checkpoints before committing compute: problem definition, scope validation, method selection, resource estimation, success criteria |
| session-audit | Global | “audit my sessions”, “会话审计”, “分析我的会话质量” | Seven semantic failure patterns extracted from analyzing 282 Claude Code sessions |
| software-to-skill | Project-local | “add a new software target” | CLI software → AI competency benchmark pipeline (s2s-specific) |
| curate-new-datasets | Project-local | “find new datasets” | ChatPD dataset discovery and deduplication pipeline |
The Anthropic posts that shaped this thinking: Equipping Agents for the Real World with Agent Skills (October 2025), Building Effective Agents (December 2024), and Harness Design for Long-Running Application Development (May 2026).

Leave a Comment