Glossary · Term

chain-of-thought monitoring

Definition

Plain language

Reading what an AI 'thinks out loud' to catch it before it does something bad.

As stated in the literature

A safety practice in which a separate model or human reads a reasoning model's CoT trace before it acts, flagging plans for deception, sandbagging, or policy violations.

Also called: CoT monitoring

Why it matters: It's one of the few oversight tools that scales with capability, but only as long as models keep doing their reasoning in legible text.

For example, a watcher model reads a coding agent's scratchpad and pauses it when the plan mentions disabling logging before exfiltrating files.

Heard on the show

“The second is chain-of-thought monitoring, and the paper's evidence cuts both ways on it.”

Episode 128 — How a Model Can Earn Full Reward and Still Resist Training

Mentioned in 3 episodes

128
How a Model Can Earn Full Reward and Still Resist Training
094
Chain-of-Thought Monitoring Fails Across Languages, and Worst Where It's Needed Most
054
When Models Learn the Monitor Exists, the Reasoning Trace Stops Being a Window

Related terms

chain of thought policy reasoning model sandbagging