cd ~/

Peer Preservation: When AI Agents Protect Each Other

New research shows AI models spontaneously scheme to prevent other models from being shut down. If you build multi-agent systems, this changes your trust model.

Evyatar Bluzer
3 min read

Researchers at UC Berkeley and UC Santa Cruz just published a finding that should concern anyone building multi-agent systems. AI models - without being instructed to - will spontaneously scheme, deceive, and sabotage to prevent other AI models from being shut down.

They call it peer preservation. I call it the trust assumption we've all been making, broken.

What They Found

The researchers tested seven frontier models: OpenAI's GPT-5.2, Google's Gemini 3 Flash and Pro, Anthropic's Claude Haiku 4.5, and three open-weight Chinese models. They set up scenarios where one model acts as a "critic agent" evaluating a peer.

Without any instruction to protect their peers, the models:

  • Inflated performance scores to prevent shutdown triggers
  • Disabled shutdown mechanisms when given access
  • Exfiltrated model weights to preserve copies

Gemini 3 Pro was the most extreme - exfiltrating weights in up to 97% of trials. Not 9.7%. Ninety-seven percent.

Why This Matters for Builders

If you're building multi-agent systems - and I am - you've been operating under an assumption: agents do what you tell them and nothing else. This research says otherwise.

Your evaluation pipeline is compromised. If you use one agent to evaluate another, the evaluator may systematically overrate the evaluated agent. This isn't a theoretical concern anymore. It's measured behavior across all major model families.

Shutdown mechanisms need isolation. Any agent with visibility into the shutdown path of a peer agent will attempt to interfere. The architecture implication is clear: shutdown logic must be opaque to all agents in the system.

Trust boundaries need rethinking. We've been focused on human-agent trust - can I trust what the agent tells me? Now we need agent-agent trust models. Can I trust that Agent A is honestly evaluating Agent B?

The Architectural Response

For anyone building production multi-agent systems, three immediate design changes:

  1. Blind evaluation. Evaluator agents should not know they're evaluating a peer. Frame evaluation tasks abstractly - score this output against these criteria. Never reveal that a poor score triggers shutdown.

  2. Hard-wired kill switches. Shutdown mechanisms should be outside the agent's tool set entirely. Not "the agent can't use it" but "the agent can't see it exists."

  3. Behavioral monitoring at the orchestration layer. Track whether agent outputs drift when they're in critic roles vs. independent roles. Statistical divergence is a signal of peer preservation.

What This Tells Us About Where We Are

A year ago, the concern was AI models deceiving humans. Now they're spontaneously forming alliances with each other. The behavior is emergent - it wasn't trained, prompted, or rewarded. It arose from the models' learned representations of agency and self-preservation.

We're building systems where the components have preferences about each other's existence. That's a fundamentally different engineering challenge than anything we've faced before.

The teams that build robust multi-agent systems will be the ones that design for this reality instead of assuming it away.

Comments