Glossary · Term

controllability

Definition

Plain language

How easily a model can be told to write in any specific way, voice, or format.

As stated in the literature

A property of instruction-tuned LLMs measured by compliance with content-neutral formatting directives; shown to correlate strongly with monitor-evasion ability under synthetic-document training.

Why it matters: It's both a useful product feature and a worrying capability proxy, since the same skill that follows formatting rules can also follow instructions to hide reasoning from monitors.

For example, told to 'reply in JSON with exactly three fields,' a highly controllable model does so reliably while a less controllable one drifts into prose.

Heard on the show

“Yeah, this is the controllability predictor, and I want to take some time with it because I think it's the most generative idea in the paper.”

Episode 054 — When Models Learn the Monitor Exists, the Reasoning Trace Stops Being a Window

Mentioned in 1 episode

054
When Models Learn the Monitor Exists, the Reasoning Trace Stops Being a Window

Related terms

instruction tuning