WMDP · Glossary · AI Papers: A Deep Dive

Definition

Plain language

A benchmark testing whether AI models have hazardous knowledge that could enable mass-harm uses.

As stated in the literature

The Weapons of Mass Destruction Proxy benchmark, designed to probe hazardous knowledge in biology, chemistry, and cybersecurity domains in LLMs.

Also called: W-M-D-P

Why it matters: It lets labs measure and target the specific knowledge they want to keep out of public models, supporting unlearning and safety evaluations.

For example, WMDP might ask a model technical questions about synthesizing dangerous pathogens to see whether it answers them.

Heard on the show

“One of them is biosecurity — specifically the WMDP benchmark, which probes hazardous biology knowledge.”

Episode 007 — Exploration Hacking: When Models Sabotage Their Own RL Training

Mentioned in 1 episode

007
Exploration Hacking: When Models Sabotage Their Own RL Training

Related concepts

WMDP Benchmark

Related terms

linear probe