How to Build AI Agents: Lessons from Five Projects
Published:
How to Build AI Agents: Lessons from Five Projects
Over the past year, I’ve built a series of AI agent systems. A multi-agent pipeline that fixes bugs in real codebases. A federated query engine that audits whether ML datasets are actually downloadable. A benchmark that measures how well agents can use unfamiliar command-line tools. A full research-agent operating system with memory consolidation and self-evolution.
Across these projects, one question came up again and again: how much should I, the human builder, specify up front, and how much should I let the agent figure out?
Anthropic’s engineering team has been publishing some of the best thinking on this question. Their series of posts—from the foundational Building Effective Agents (December 2024) through to the recent Harness Design for Long-Running Application Development (May 2026)—form what I think is the most coherent engineering philosophy for agent builders right now.
This post is my attempt to synthesize what I learned from their writing with what I learned from building. The projects I’ll draw on are DataQuery (dataset accessibility auditing), multi-agent-topo (SWE-bench bug-fixing pipeline), s2s (agent capability benchmarking), OmniScientist/Helixforge (research agent OS), and ChatPD (paper-to-dataset extraction pipeline).
The Core Framework: Agent = Model + Harness
The single most useful framing comes from Anthropic’s harness design post: an agent is the model plus the harness around it. The harness is everything you build around the model—prompts, tools, context policies, sandboxes, feedback loops, recovery paths.
“A decent model with a great harness beats a great model with a bad harness.” — Harness Design for Long-Running Apps
Every component in your harness encodes an assumption about what the model can’t do on its own. Those assumptions go stale as models improve. Part of your job is to keep re-testing them.
When I first built the multi-agent-topo pipeline, my harness assumed the model needed a six-step workflow written into the system prompt: ls the directory, grep for keywords, cat the file, trace the logic, sed -i to edit, git diff to verify. This was me encoding my own debugging habits as a mandatory script. The model followed orders—even when a different approach would have been faster. It was, as Anthropic puts it, “grown more than it was built”—and I had over-built the wrong parts.
The rest of this post walks through the design principles I arrived at, organized around one fundamental idea: find the simplest thing that works, then only add complexity when you can measure the improvement.
Principle 1: Start with a Single Call, Not an Agent
Anthropic’s opening recommendation in Building Effective Agents is unambiguous:
“Start by using LLM APIs directly. Many patterns can be implemented in a few lines of code.”
They introduce a distinction between workflows (LLMs orchestrated through predefined code paths) and agents (LLMs that dynamically direct their own process and tool usage). The advice: begin with a single, well-crafted LLM call. Add retrieval and in-context examples. Only graduate to multi-step agentic patterns when a simpler approach demonstrably fails.
In my projects, I’ve violated this principle more often than I’d like to admit. The multi-agent-topo pipeline started as an elaborate five-stage architecture—analysis agent, localization agent, challenge node, repair agent, verification—before I had validated that a single interactive agent loop with bash + submit tools could outperform the whole thing. When I finally ran the baseline experiment, the standalone mode (single agent with tools, no pipeline) achieved a 73% submit rate with 40.4 average turns—while the challenge-injected pipeline hit only 50% submit with 50 turns.
The lesson: before you build a pipeline, prove that a single agent can’t do the job.
Principle 2: Prompt the Goal, Not the Recipe
This is the single change that had the largest impact across my projects.
Here is what my early agent prompt looked like:
1. Run ls /testbed to find relevant directories
2. Use grep -r "keyword" /testbed/ to search
3. Read files with cat /testbed/path/to/file.py
4. Trace the code logic
5. Edit with sed -i
6. Run git diff and submit
And here is the DataQuery auditor prompt, after I learned better:
Your task: determine whether an academic researcher can actually obtain
this data. The URL is a starting point, not the only path. Use your tools
and judgment.
The first version turns the agent into an executor. The second gives it a goal and trusts it to navigate. As Anthropic puts it in their context engineering post: every token depletes the model’s “attention budget”—so make each one count Effective Context Engineering for AI Agents. A numbered workflow burns attention budget on instructions the agent doesn’t need, while crowding out the context it actually needs to reason about the problem.
Principle 3: Separate Execution from Verification
This is the architectural decision I’m most confident about, and it aligns directly with the Evaluator-Optimizer pattern Anthropic describes across multiple posts.
“Agents tend to respond by confidently praising their own work even when it is, to a human reviewer, mediocre at best. Separating the judging agent from the doing agent proved powerful.” — Harness Design for Long-Running Apps
In DataQuery, the Auditor agent tries to access a dataset. The Verifier agent inspects the downloaded files. They are two completely independent Claude Code sessions. The Verifier never sees the Auditor’s report. They share only raw artifacts—files on disk.
In our multi-agent-topo experiments, when we let the same agent re-examine its own output (the “self-reflection” pattern), the submission rate dropped from 73% to 50%. The agents became more hesitant after being challenged, spent more turns re-reading the same files, but did not produce better patches. Self-evaluation created the illusion of rigor without the substance.
A second lesson from Anthropic’s harness post: the evaluator needs to be tuned to be skeptical. Not aggressive—skeptical. Their evaluator used four graded criteria (design quality, originality, craft, functionality) rather than asking “is this good?”, which reliably produced rubber-stamp approvals. In DataQuery, we achieved the same effect with a V1–V4 verification taxonomy: file existence → format validation → content integrity → total size. Each check is a binary pass/fail, leaving no room for “looks fine to me.”
Principle 4: The Skill Library—Learn from Trajectories
Anthropic’s Agent Skills post introduces a three-level progressive disclosure mechanism: metadata loaded at startup, full instructions loaded on task match, and bundled resources loaded on demand. Skills transform a general-purpose agent into a specialized one by packaging domain knowledge into composable, discoverable units.
DataQuery’s Skill Library takes this idea one step further: the skills are automatically extracted from the agent’s own successful trajectories.
Here is the loop:
Extract — After a successful dataset access, another LLM call analyzes the trajectory and extracts a reusable pattern. Crucially, we don’t extract a script (“step 1: curl X, step 2: wget Y”). We extract a cognitive model: “GitHub repos are often just pointers; the actual data lives in Releases or on mirror sites like HuggingFace. Check there first.”
Consolidate — When 5–10 trajectories from the same platform accumulate, a merge pass generalizes them. The prompt explicitly says: “Do NOT write a checklist of curl/wget commands. Instead, teach a future agent HOW TO THINK about this platform.”
Select — On a new query, the system automatically matches the top-N highest-success-rate skills for the target platform. Skills with persistently low success rates are automatically deprecated—not by human judgment, but by statistical evidence.
Bootstrap — The initial skills are human-written decision trees with concrete evidence citations. Once the system is running, automated extraction takes over.
Anthropic notes that a future direction is enabling agents to “create, edit, and evaluate Skills on their own.” DataQuery’s extraction pipeline is a working prototype of exactly that.
Principle 5: Schema Over Script
One of the subtler but most important distinctions: a schema describes what the output should look like. A script prescribes what the agent should do.
In DataQuery, the barrier taxonomy classifies access outcomes along three dimensions: R (reachability: R0–R2), I (interface: I0–I4), A (accessibility: A0–A4). It says: “When you report results, classify what happened using these dimensions.” It does not say: “If you get a 404, then use wget with –mirror.”
In earlier projects, I blurred this line. The MARC-DSL for robot coordination started as a communication schema—a vocabulary for expressing plans. Over time, it hardened into a script: robots couldn’t express intentions the DSL didn’t cover.
Anthropic’s skills design follows the same principle. A SKILL.md provides methodology and reference material—not prescribed steps. The skill says “here’s what we know about this domain and how to reason about it.” The agent decides what to do with that knowledge.
“Give Claude necessary information but flexibility to adapt.” — Skill design tip from Anthropic’s March 2026 update
Principle 6: The Sprint Contract
Both s2s and DataQuery independently arrived at a pattern Anthropic formalizes as the Sprint Contract:
“Before each sprint the generator and evaluator negotiate a sprint contract: agreeing on what ‘done’ looks like for that chunk of work before any code was written.” — Harness Design for Long-Running Apps
In DataQuery, the agent writes plan.json before acting: what approach it will try first, what barrier codes it expects to find. After execution, the system compares expectations to reality. A mismatch is recorded but doesn’t trigger a retry—it’s purely observational. Over time, systematic gaps (agent consistently expects R0 but gets R1) reveal that its cognitive model of a platform is wrong and needs updating.
This is different from runtime intervention—like multi-agent-topo’s mid-session challenge injection. The Sprint Contract doesn’t interrupt. It observes the gap between expectation and reality, then feeds that gap back into the learning loop for the next run.
Principle 7: Deterministic Where Possible, LLM Where Necessary
Anthropic’s post on harnessing Claude’s intelligence makes a pointed recommendation: “Use declarative tools for UX, observability, or security.” Promote actions to dedicated tools with typed arguments when you need interception, gating, rendering, or auditing. For everything else, let Claude use what it already knows—bash and a text editor.
DataQuery’s verification pipeline follows the same split:
- Rust layer — Checks file existence, validates magic bytes (
\x89PNG,PK\x03\x04,\x1f\x8b), detects HTML error pages masquerading as data files, verifies minimum download size. Deterministic, zero API cost, runs in milliseconds. - LLM layer — A separate Verifier session judges whether the downloaded files “look like actual data.” One prompt handles qualitative judgment that can’t be reduced to if-statements.
The principle: never pay API tokens for something a deterministic check can do. Conversely, never hardcode thirty if-statements for something an LLM can judge in one prompt.
Principle 8: Ask “What Can I Stop Doing?”
This is, to me, the most counterintuitive and important idea in Anthropic’s engineering philosophy:
“Every component in your harness encodes an assumption about what the model can’t do on its own. Those assumptions go stale as models improve.” — Harnessing Claude’s Intelligence
Their concrete example: Claude Sonnet 4.5 exhibited “context anxiety”—rushing to finish when nearing the context limit. They built a context-reset mechanism to compensate. Claude Opus 4.5 largely eliminated that behavior on its own. The reset mechanism became dead weight. They removed it.
In my own projects, the multi-agent-topo pipeline still carries components I added for earlier, weaker models. The six-step workflow in the system prompt. The _is_stuck() function with its hand-tuned thresholds (5, 3, 15, 10, 6, 2). The challenge injection triggered by confidence < 0.7. Each of these was a reasonable response to a model limitation at the time. Many of them are now dead weight.
The discipline: after every model upgrade, re-baseline with the simplest possible harness. Remove anything that no longer moves the needle. Anthropic calls this “the harness doesn’t shrink, it moves”—as old assumptions become obsolete, new ceilings unlock new scaffolding needs. But the direction of travel is always toward less, not more.
The Five-Layer Framework
Bringing all of this together, here is how I think about what the human specifies versus what the agent decides:
| Layer | What It Is | Human Role | Agent Role |
|---|---|---|---|
| 0: Hard Constraints | Safety bounds, cost limits, infrastructure | Set absolute limits | Cannot override |
| 1: Goal & Success Criteria | What to achieve, what “done” means | Define target and verification | How to get there |
| 2: Exploration Space | Available tools and capabilities | Provide composable tools | Choose what, when, how |
| 3: Self-Awareness | Agent’s ability to assess its own state | Define what self-reflection looks like (schema) | Judge progress, confidence, completion |
| 4: Self-Evolution | Learning from experience | Bootstrap initial knowledge, design the loop | Extract patterns, update models |
Most agent projects—including my own early ones—live at Layer 2, with humans inadvertently making Layer 3 decisions through hardcoded thresholds. The meta-skill of agent building is knowing, for each design decision, which layer it belongs in and whether you’ve placed it too low.
Where to Start
If I were starting a new agent project tomorrow, here’s what I’d do, ordered by impact-to-effort:
Write the goal, not the steps. Delete every numbered list from your system prompt. Replace with a target description and a definition of done.
Run a baseline with a single agent call before building a pipeline. In my experience, the single agent usually beats the pipeline on first attempt. Only add multi-step orchestration when you can measure the gap.
Add an independent evaluator. One session that sees only raw outputs, not the executor’s reasoning. Make it pass/fail.
Add a plan artifact. Before execution, write expectations to a file. After execution, diff expectation against reality. Don’t interrupt—just record. Feed the gaps back into the next run.
Start a skill library. Even a simple JSON file mapping platform → successful patterns. Feed it as context to future runs.
After every model upgrade, prune. Re-run the baseline without your harness components. Delete anything that no longer moves the needle.
The projects this methodology is drawn from: DataQuery (federated dataset accessibility auditing), multi-agent-topo (multi-agent SWE-bench pipeline), s2s (agent capability benchmarking), OmniScientist/Helixforge (research agent OS), and ChatPD (paper-to-dataset extraction).
The Anthropic posts that shaped this thinking: Building Effective Agents, Effective Context Engineering for AI Agents, Equipping Agents for the Real World with Agent Skills, Harnessing Claude’s Intelligence, and Harness Design for Long-Running Application Development.
Cite This Post
@misc{xu2026agent-methodology,
author = {Anjie Xu},
title = {How to Build {AI} Agents: Lessons from Five Projects},
year = {2026},
month = jun,
howpublished = {\url{https://anjiexu-pku.github.io/tech/ai/building-ai-agents-methodology/}},
}

Leave a Comment