Glossary · Term

Capability Paradox

← all terms

Definition

The finding that upgrading the smartest model in an AI security pipeline can make the system as a whole less safe.

The result that in hierarchical multi-agent systems with semantic attacks, Worker auditor vulnerability to adversarial narratives correlates positively with raw capability scores (MMLU, GPQA), because higher-capability auditors produce more linguistically certain reports that the Manager treats as authoritative.

Mentioned in 1 episode

  1. 058
    Why Upgrading Your AI Auditor to a Smarter Model Can Make Your System Less Safe