Harness Engineering: The Discipline That Quietly Won
LangChain jumped 25 ranks on TerminalBench by changing only the harness. The model never changed.
The biggest performance lever in AI agents is not the model. It is the harness.
Agent = Model + Harness
The term crystallized this month. OpenAI shipped its Agents SDK update on April 15 with a model-native harness - sandbox execution, filesystem tools, snapshot and rehydration - built from the same scaffolding that powers Codex. Martin Fowler published a detailed synthesis the same week. An awesome-harness-engineering repo appeared on GitHub. The concept now has a name, a community, and a discipline.
Harness means everything in an agent that is not the model itself. The tools it can call, the guardrails that constrain it, the feedback loops that help it self-correct, and the observability layer that lets humans monitor behavior.
The LangChain Proof Point
LangChain's Deep Agents team took GPT-5.2-Codex from outside the top 30 at 52.8% to rank 5 at 66.5% on TerminalBench 2.0. The model did not change. They swapped the harness - self-verification loops, loop detection middleware, better context engineering.
A 14-point jump from infrastructure alone. That result should change how every team allocates engineering effort.
Seven Components
The emerging taxonomy breaks a harness into seven categories:
- Guides - feedforward controls that steer before the agent acts
- Sensors - feedback controls that observe after the agent acts and trigger self-correction
- Rails - hard constraints that prevent catastrophic actions
- Scaffolds - the execution environment, sandbox, filesystem access
- Exemplars - few-shot examples and demonstrations
- Mirrors - self-reflection and evaluation mechanisms
- Throttles - rate limiting, cost controls, token budgets
Fowler frames it as three interlocking systems: context engineering, architectural constraints, and entropy management. Both taxonomies point at the same truth. The model is necessary but not sufficient.
What This Means for Builders
I have been building agent systems where the harness work dwarfs the prompt work for months. But until now there was no shared vocabulary for what we were actually doing.
Fowler identifies three human postures - humans outside, in, or on the loop. His argument: maintaining the harness rather than reviewing individual outputs is the only approach that scales with agent throughput. I agree. The CLAUDE.md files, AGENTS.md conventions, and skill configurations I work with daily are all harness engineering. We just did not have the name.
Prompt engineering got us started. Harness engineering is what gets us to production.