Glossary · Term

controllability

← all terms

Definition

How easily a model can be told to write in any specific way, voice, or format.

A property of instruction-tuned LLMs measured by compliance with content-neutral formatting directives; shown to correlate strongly with monitor-evasion ability under synthetic-document training.

Mentioned in 1 episode

  1. 054
    When Models Learn the Monitor Exists, the Reasoning Trace Stops Being a Window