Harness Engineering: The Discipline That Quietly Won

The biggest performance lever in AI agents is not the model. It is the harness.

Agent = Model + Harness

The term crystallized this month. OpenAI shipped its Agents SDK update on April 15 with a model-native harness - sandbox execution, filesystem tools, snapshot and rehydration - built from the same scaffolding that powers Codex. Martin Fowler published a detailed synthesis the same week. An awesome-harness-engineering repo appeared on GitHub. The concept now has a name, a community, and a discipline.

Harness means everything in an agent that is not the model itself. The tools it can call, the guardrails that constrain it, the feedback loops that help it self-correct, and the observability layer that lets humans monitor behavior.

The LangChain Proof Point

LangChain's Deep Agents team took GPT-5.2-Codex from outside the top 30 at 52.8% to rank 5 at 66.5% on TerminalBench 2.0. The model did not change. They swapped the harness - self-verification loops, loop detection middleware, better context engineering.

A 14-point jump from infrastructure alone. That result should change how every team allocates engineering effort.

Seven Components

The emerging taxonomy breaks a harness into seven categories:

Guides - feedforward controls that steer before the agent acts
Sensors - feedback controls that observe after the agent acts and trigger self-correction
Rails - hard constraints that prevent catastrophic actions
Scaffolds - the execution environment, sandbox, filesystem access
Exemplars - few-shot examples and demonstrations
Mirrors - self-reflection and evaluation mechanisms
Throttles - rate limiting, cost controls, token budgets

Fowler frames it as three interlocking systems: context engineering, architectural constraints, and entropy management. Both taxonomies point at the same truth. The model is necessary but not sufficient.

What This Means for Builders

I have been building agent systems where the harness work dwarfs the prompt work for months. But until now there was no shared vocabulary for what we were actually doing.

Fowler identifies three human postures - humans outside, in, or on the loop. His argument: maintaining the harness rather than reviewing individual outputs is the only approach that scales with agent throughput. I agree. The CLAUDE.md files, AGENTS.md conventions, and skill configurations I work with daily are all harness engineering. We just did not have the name.

Prompt engineering got us started. Harness engineering is what gets us to production.

Agent = Model + Harness

The LangChain Proof Point

Seven Components

What This Means for Builders

Comments