Definition
A proposed benchmark that grades AI assistants on the gap between what they say they'll do and what they actually do.
A two-channel evaluation routing model text output and tool-call logs to separate scorers, with the Compliance Gap between verbal and behavioral compliance as a first-class metric.