From Agent Mania to Research Automation: AI Breaking Through the Frontiers of Knowledge
Published:
From Agent Mania to Research Automation: AI Breaking Through the Frontiers of Knowledge
Ever since Openclaw came out, I’ve been meaning to write about all this — but it’s taken me until now to actually sit down and do it.
Over the Chinese New Year, Openclaw exploded. For the first time, ordinary people viscerally felt the power of AI-assisted programming. But after the frenzy, things settled back down. At the start of the year, everyone was scrambling for API keys from Qwen, Zhipu, MiniMax. Then last month, Anthropic started banning accounts en masse. Now OpenAI is doing the same. Most people can’t even get Plus or Pro subscriptions anymore.
Xiaomi poached Luo Fuli from DeepSeek and somehow shipped MiMo, catching up to the front line almost overnight. DeepSeek then dropped V4 — 1M context attention with insanely good cache hit rates. I’m hitting 98% on my end, which is just ridiculous. I spent 60 RMB to run Claude Code + DeepSeek-V4-Pro hard for two days, and honestly, it felt almost on par with Claude 4.6 Sonnet.
There’s another thing I find genuinely impressive: DeepSeek pulled off model inference at this scale using Huawei GPUs. I’ve talked to plenty of people doing model training, and they really don’t like working with Huawei’s hardware — the compatibility issues and pitfalls are endless. But DeepSeek made it happen anyway. It reminds me of that line from Xunzi they quoted:
不诱于誉,不恐于诽,率道而行,端然正己。
Unswayed by praise, unafraid of slander, follow the path and hold yourself upright.
The Data Flywheel Puzzle
I’ve always believed in the data flywheel. OpenAI and Anthropic are so strong precisely because they leverage massive user feedback to refine their models. That’s why, among domestic players, I’ve always liked Kimi — they built Kimi Code, a channel to collect data at scale and feed it back into the model.
But DeepSeek puzzles me. DeepSeek-R1 is honestly not that strong on long-horizon coding tasks — very few people actually use it for serious, complex code development. Their data flywheel looks weak. So how are they still this good?
The more I think about it, the more the answer seems to be the same old recipe: aggressively distill from Claude / GPT, combined with extremely strong RL.
The data flywheel isn’t everything. Without massive user feedback, strong RL + high-quality distillation can still get you to the front line. For teams in China building foundation models, this might be more pragmatic than blindly chasing user volume.
What’s Next?
Models are now so powerful that any ordinary person can easily bring their ideas and code to life. GPT-5.6 is supposedly around the corner, and humanity keeps pushing the limits of scaling laws, intelligence ceilings, AI infra, and chip compute. But I’m not that excited about GPT-5.6 anymore — because the set of things that “only a smarter model can do” is shrinking. I wonder: is there something new?
Openclaw is great, but even with Peter reportedly shipping at a million lines of code per month, it still has massive unsolved problems. The memory mechanisms, in particular, remain terrible.
These technical issues will eventually get solved, I think. But I keep asking myself: where do we go next? Setting aside the “just scale more data” direction —
Can we switch to a different track entirely?
Let AI Do Research
I’ve been working on AI-automated scientific research — getting AI to explore science on its own. Research is fundamentally different from coding. Most of it involves extremely long-horizon tasks; you can’t just set up a pipeline and let it run. AI currently can’t autonomously complete research projects with that level of engineering complexity.
But that also means there are opportunities:
1. Build General-Purpose Research Platforms
Take VLM training as an example. Combine Agent workflows to prepare mainstream datasets, data cleaning methods, and training baselines, with a solid codebase. Anyone who wants to optimize VLM algorithms later can build directly on top of it. It serves both human researchers tweaking code for papers and Agents exploring research directions.
2. Identify Short-Horizon Research Tasks
Plenty of papers only change a few dozen lines of code, or swap in a different dataset — these are perfect for AI to tackle.
3. Strengthen AI’s Long-Horizon Capabilities
There are several paths here: train better models, build harnesses, optimize skills, and decompose traditional research pipelines so AI can operate under controlled conditions on specific steps.
4. Let AI Tackle Theoretical Problems
Combine AI’s growing capability in formal reasoning to break through certain problems from the theoretical side. Let AI pick up mathematical tools and attempt solutions. But my theoretical foundation is relatively weak — I can’t yet see clearly what specifically can be done. The hardest part, really, is: how do you find a problem worth solving?
What About the Real World?
This methodology, combined with AI’s now-superhuman coding ability, can solve a lot through engineering alone. But is that enough?
In purely virtual domains (non-physical settings), strong engineering plus some algorithmic tricks can handle most research. But the moment you touch the real world, it gets messy.
The most direct example is embodied intelligence. Studying intelligent behavior in the physical world sounds cool but is painful in practice. Robotic arm calibration, sensor noise, environmental uncertainty — as soon as AI needs to interact with physical reality, it has to confront this chaos. In virtual environments you can restart and precisely reproduce everything; in the real world, a screw loosened by half a turn can ruin an entire experiment. There’s no clear way out for automated research at this level.
Cross-disciplinary work is even harder. Beyond computer science, fields like mathematics (where AI + formal methods have been advancing fast), physics, chemistry, biology, materials — these disciplines are deeply tied to everyone’s lives, yet AI’s role remains severely limited. Biology and medicine are especially telling: data is highly siloed, and problems with paper fraud and reproducibility are rampant.
So is there a solution? My tentative idea is: turn other disciplines into RL-trainable data environments wherever possible.
Let’s start with why math RL works: math problems have automatically verifiable answers, traceable steps, and clear right-or-wrong signals. Break it down: state space × action space × reward function. Models repeatedly sample, verify, and update policy — once the paradigm clicks, reasoning capability takes off.
Now consider biology. Wet-lab experiments don’t have an automatic verifier, but here’s the key insight: not being able to auto-verify the final answer doesn’t mean you can’t do RL. We can break the evaluation into multiple dimensions:
| Dimension | Question |
|---|---|
| Feasibility | Are reagents, equipment, and procedures within constraints? |
| Information Gain | How much of the hypothesis space does this eliminate? |
| Cost | Time and consumable overhead |
| Reproducibility | Are steps standardized? Where are the noise sources? |
| Safety | Biosafety level, ethics compliance |
Any single dimension might be weak and noisy, but together they form a multi-objective reward signal. Combined with PPO/GRPO, that’s enough to give the model an optimizable direction.
Pipeline: AI proposes hypotheses → generates experimental plans in simulation → multi-dimensional
rewardscoring →RLiterative optimization →top-kplans handed to humans for execution → results fed back into thereward model.
Structurally, it’s the same as math RL — the reward just shifts from True/False to a multi-dimensional weighted score. The real difficulties are: keeping the reward model dimensions orthogonal enough that the model can’t exploit a loophole in one dimension to game the score, building the simulation environment, and the painfully long human feedback loop — each of these is a hard problem.
But this direction might be right: turn unstructured disciplinary problems into structured tasks that can be multi-dimensionally scored and optimized through RL.
Closing Thoughts
Every time I read a top-tier technical report from a major lab, I feel a pang of envy, wishing my own work could have that kind of impact.
Just a passing thought. Now it’s back to the research trenches. May technology advance a little faster, and make everyone’s lives a little better.
Cite This Post
@misc{xu2026agent,
author = {Anjie Xu},
title = {From Agent Mania to Research Automation: {AI} Breaking Through the Frontiers of Knowledge},
year = {2026},
month = may,
howpublished = {\url{https://anjiexu-pku.github.io/tech/ai/from-agent-mania-to-research-automation/}},
}
