Glossary · Term

Capability Paradox

Definition

Plain language

The finding that upgrading the smartest model in an AI security pipeline can make the system as a whole less safe.

As stated in the literature

The result that in hierarchical multi-agent systems with semantic attacks, Worker auditor vulnerability to adversarial narratives correlates positively with raw capability scores, because higher-capability auditors produce more linguistically certain reports that the Manager treats as authoritative.

Why it matters: It's a counterintuitive warning for AI security architecture: stronger components don't automatically make the overall system safer, and may even make it worse.

For example, upgrading a tier of auditors to a more capable model made their reports more confidently worded, which made the Manager trust their hallucinated justifications more.

Heard on the show

“… The paper itself is called "The Capability Paradox: How Smarter Auditors Make Multi-Agent Systems Less Secure," from a team at the Chinese Academy …”

Episode 058 — Why Upgrading Your AI Auditor to a Smarter Model Can Make Your System Less Safe

Mentioned in 1 episode

058
Why Upgrading Your AI Auditor to a Smarter Model Can Make Your System Less Safe

Related terms

agent capability