Design Dimensions for Research Infrastructure

15 minute read

Published:

Design Dimensions for Research Infrastructure

Over the past few years I’ve built several pieces of research infrastructure. ChatPD—a data pipeline processing hundreds of thousands of arXiv papers. SkillFab—a platform for creating, reviewing, and publishing agent skills. multi-agent-topo—an experimental framework comparing multi-agent topologies across 500 SWE-bench instances. s2s—a tool that lets Claude Code autonomously explore open-source software and generate competency benchmarks.

These projects differ in scale, language, and audience. But building them surfaced the same set of problems: how do you keep a pipeline running after it crashes at hour 14? How do you know if 3% of your records silently corrupted? How do you max out concurrency without getting rate-limited into the ground?

Every insight here comes from a specific bug, incident, or near-miss.


Failure Recovery: Slow Is Fine, Silent Is Not

Data pipelines eventually encounter every kind of failure. LLM APIs return 429. Source servers go down. PDFs turn out malformed halfway through. Networks hiccup.

ChatPD’s core logic is simple: distinguish “retrying might help” from “retrying won’t change anything.”

fn classify_error(err: &PipelineError) -> Option<ProcessingStatus> {
    match err {
        // Transient: retry might succeed
        PipelineError::Http(429) | PipelineError::Timeout |
        PipelineError::ConnectionReset => None,

        // Terminal: retry is useless → record and skip permanently
        PipelineError::Http(404) =>
            Some(ProcessingStatus::SourceUnavailable),
        PipelineError::ParseError =>
            Some(ProcessingStatus::NoContent),

        // Quota exhausted → global abort, don't burn remaining budget
        PipelineError::Http(401) | PipelineError::QuotaExhausted => {
            abort_flag.store(true, Ordering::Relaxed);
            Some(ProcessingStatus::Aborted)
        }
    }
}

This classification came from a specific failure: an early version retried everything 3 times, wasting quota repeatedly hitting 404s that would never succeed.

Same lesson, different form: checkpointing doesn’t need a complex system. ChatPD’s resume logic is under ten lines—query completed paper IDs from the database at startup, skip them in the main loop. Combined with idempotent writes (ON CONFLICT DO UPDATE), the pipeline produces identical results no matter how many times it runs.

Recovery Levels

Different projects need different recovery sophistication. Building enough of these taught me to match the level to the project rather than over-engineering:

  • Level 1 — Skip existing output. if os.path.exists(out): return. One line.
  • Level 2 — Track processed IDs. multi-agent-topo lives here—check which instance JSON outputs exist at startup.
  • Level 3 — Structured checkpoints. ChatPD’s processing_status table—knows not just whether something finished, but which stage and what outcome.
  • Level 4 — Auto-recovery strategy. Escalate by failure type: transient → retry, persistent → rollback checkpoint, non-critical → log and skip, critical → pause.
  • Level 5 — Survive process death. Checkpoints in DB, work items with leases, graceful shutdown on SIGTERM.

ChatPD runs at Level 4. multi-agent-topo and s2s at Level 2. No project has needed Level 5 yet.


Efficiency: Find the Bottleneck First

Programs have exactly two bottleneck types: compute-bound or bandwidth-bound. Check IPC with perf stat—IPC > 2 and low cache misses → compute-bound, optimize the math. IPC < 1 and high cache misses → bandwidth-bound, reorganize the data. Optimizing the wrong one adds overhead with no gain.

Two simple laws prevent most guesswork on concurrency:

Amdahl’s Law: 10% serial portion → infinite cores give 10× speedup at best. Shrinking the serial portion matters more than adding cores.

Little’s Law: concurrency = throughput × latency. 2-second LLM calls, 50 req/s target → at least 100 concurrent workers. Derive max_workers from the formula, not intuition.

Bounded Channel Pipelines

ChatPD chains four stages through three bounded mpsc channels. The key design: queues have caps. When DB writes slow down, upstream channels fill and backpressure propagates naturally—no unbounded memory growth. Each stage’s concurrency is tuned independently because their bottlenecks differ: Fetch is I/O-bound, LLM calls are latency-bound, DB writes are disk-bound.

Adaptive Concurrency

Starting at max concurrency gets you rate-limited immediately. ChatPD’s Fetch stage begins at half the target and adds a slot every 32 successes. On 429, it ramps down and waits 90 seconds between rounds. The pipeline finds its own stable operating point rather than relying on a human-picked “safe” concurrency number that turns out to be either too conservative or too aggressive.

Two Easy-to-Miss Details

Reuse containers, don’t rebuild. multi-agent-topo’s three experimental modes share one Docker container per instance, reset with git reset --hard base_commit. One build takes minutes; 500 instances without reuse wastes hours.

Clean up zombie containers before starting. Added after an hour-long debugging session:

docker ps -a --filter "name=sweb.eval" --format '{{.ID}}' | xargs docker rm -f

Data Integrity: The Scariest Bug Is the Silent One

The pipeline finishes. Everything looks fine. But 3% of records are missing fields, or a foreign key points to a nonexistent paper. These bugs don’t surface immediately—they corrupt every downstream analysis built on top.

Health Audits

ChatPD’s audit_health() cross-checks every pipeline run with four SQL queries: orphan responses, request gaps, mismatched IDs, and papers stuck in PENDING or IN_PROGRESS. A reconcile_month_against_metadata() compares database records against arXiv metadata line by line—”processed 10,000 papers but metadata says there should be 10,023.” Where did those 23 go?

These checks went in after discovering that a month’s papers had silently lost 200 records to a fetch-stage silent failure. The source returned an incomplete list with no error. The health audit now catches this at pipeline completion.

Golden Dataset Regression Tests

Maintain a small set of manually verified extraction results. Every pipeline code change reprocesses this batch and diffs against the golden dataset. Some diffs are expected (new fields). Some are bug fixes (previous parsing was wrong). Some make you stop—”why did 3% of records suddenly change?”

The Schema Migration Lesson

ChatPD’s database schema went through five major versions. The first flattened everything into one table, implicitly assuming “one paper → one dataset at most.” Later discoveries—papers reference multiple datasets, datasets have citation relationships—broke the initial abstraction. Changing the schema meant changing all downstream read code, the golden dataset, and the health audit SQL.

The lesson came from the cost: data model abstractions deserve the most design time because changing them has the highest cost in the entire system.


Observability: It’s Running. Now What?

Canary Pre-Checks

A full run takes days. A canary takes ten minutes. ChatPD samples equal numbers of papers from four major categories and runs the complete pipeline before committing to a full run. Only returns Go when all samples complete, no data gaps exist, and at least one succeeds.

This mechanism exists because of an incident: an 8-hour full run discovered that a prompt template formatting error had corrupted 80% of records. Eight hours of compute and API cost wasted. The canary went in immediately after.

The pattern generalizes: before a full run, test one or two units to measure per-unit time and extrapolate to full scale.

The Rate Limiter “Death Spiral”

SkillFab’s rate limiter distinguishes five dimensions, each independently limited. One implementation detail: rejected requests don’t record timestamps.

The conventional approach records every request timestamp in a sliding window. But if a rate-limited user keeps retrying, those retries refresh the window, locking them out permanently. Not recording rejections avoids this death spiral. This bug was only discoverable by observing real traffic—test environments never generate authentic rate-limit retry patterns.


System Abstraction: You Won’t Get It Right the First Time

Mechanism vs. Policy

ChatPD’s extraction pipeline has a generic LLM extraction module. It takes a prompt template and a document, returns structured JSON. It knows nothing about datasets, papers, or citations.

async fn extract_with_llm(
    doc: &Document,
    prompt_template: &PromptTemplate,
    output_schema: &JsonSchema,
) -> Result<serde_json::Value> {
    let prompt = prompt_template.render(doc);
    let response = llm_client.complete(&prompt).await?;
    Ok(parse_json(&response, output_schema)?)
}

When we later needed to extract research methods (not datasets), we wrote one new prompt template and output schema. Zero changes to the mechanism code. When testing, the mechanism can be tested with fake templates; the policy can be tested with a mock LLM. Completely decoupled.

The boundary: mechanism handles “how to call the LLM.” Policy handles “what to ask it.”

s2s’s filesystem protocol is the same idea concretized. Claude Code inside the container communicates with the host exclusively through the filesystem—JSON written to /workspace/tasks/, a watcher detects new files and acts. Zero network dependency, trivially debuggable, fully decoupled processes.

An Abstraction That Didn’t Work

multi-agent-topo’s three experimental modes share one Docker container, resetting state with git reset --hard base_commit. The assumption: all three modes have identical environments, and git reset always produces a clean state.

In practice: files written outside .gitignore coverage, git reset failures from lock files, zombie processes holding file descriptors. We patched it with git clean -fd and process cleanup, but the fundamental problem remained: state reset isn’t fully guaranteed by git reset, and this abstraction promised something it couldn’t deliver.

When to Abstract

SkillFab’s early route handlers inlined database queries directly. The same query logic appeared in three different routes—fix one, miss the other two. This wasn’t “abstraction done wrong”—it was no abstraction at all.

The later three-layer architecture (Route → TS Service → Rust Native) fixed this. But doing three layers from day one for a simple CRUD operation would have been equally stupid.

My practice now: extract after you see the repetition. First occurrence: inline. Second: tolerate. Third: extract. Fourth: congratulations, you have a battle-tested abstraction.


Specific Mistakes I’ve Made

Everything above describes things that eventually worked. These didn’t. Each has a concrete git commit behind it.

Type-safety shortcuts became debt. One as any takes 2 seconds to write. Fixing it can take 20 minutes—understanding context, defining correct types, confirming edge cases. After accumulating dozens, cleanup isn’t linear.

New API goes in, old API doesn’t come out. ChatPD’s persistence layer went through three API evolutions. Each created a new interface without deleting the old one. SkillFab had the same pattern—”remove dead code” commits appeared five or six times. Nobody proactively says “this can be deleted.”

Cross-language naming conventions weren’t settled upfront. SkillFab’s TS uses camelCase, Rust uses snake_case. At the napi boundary, neither was standardized. Settling it on day one costs nothing; deferring three months means touching every route.

Config defaults that kept changing. ChatPD’s qwen db default path changed six times across six commits. Config options were added one at a time, each with its own read logic, with no documented precedence between them.

Refactoring that broke backward compatibility. Changing a discriminated union’s structure without running “find all references” first. Downstream switch/case matching broke silently.


What I Still Don’t Have Good Answers For

Agent evaluation infrastructure. Evaluating a model = test set + metrics. Evaluating an agent = trajectories + intermediate decisions + environment stability. We’re still evaluating agents with model-era methods—checking final accuracy while ignoring where the intermediate steps went wrong. Anthropic’s Harness Design for Long-Running Apps describes using an independent Evaluator agent with Playwright to test artifacts, which is more nuanced than pass/fail, but there’s still distance from knowing why an agent made a wrong turn mid-execution.

Dataset versioning. ChatPD’s output evolves over time. Downstream uses v1, upstream is at v3—how do you know whether to rerun? Currently solved with README version numbers. It’s not enough.

The “is this even worth it” moment. ChatPD took months from first line to reliable output. Somewhere in the middle there’s always a moment where you wonder whether this was a mistake. Anthropic’s harness design post has an observation that applies here: harness components encode assumptions about model limitations that go stale as models improve. Every few months, re-examine which are load-bearing and which are dead weight. Same applies to infrastructure projects.


Anthropic posts that shaped this thinking: Building Effective Agents, Effective Context Engineering for AI Agents, Harness Design for Long-Running Application Development.

Leave a Comment

LinkedIn QQ空间 知乎