Glossary · Term

audit

Definition

A controlled experiment that probes a deployed AI system for capabilities or propensities of concern.

In safety evaluation, a structured behavioral probe of frontier models across scaffolding levels designed to estimate capability and propensity for a target failure mode like exploration hacking or peer preservation.

Mentioned in 17 episodes

082
Training a Deep Research Agent on 8,000 Synthetic Tasks: The Rubric Tree Trick
078
Training a Markdown File: When LLM Self-Improvement Borrows the Discipline of Neural Net Training
075
Growing Code and Proof Together: Verified Systems in Ten Hours Instead of a Year
062
Treating Hallucinations as Exploits: A Gate-Based Architecture for Agent Safety
061
When Helpful Agents Go Sideways: A 404 Error, Campus Security, and Why Alignment Misses This
058
Why Upgrading Your AI Auditor to a Smarter Model Can Make Your System Less Safe
054
When Models Learn the Monitor Exists, the Reasoning Trace Stops Being a Window
049
An AI Agent Reached for Root in Twelve Minutes, Without Being Attacked
039
When Smarter Agents Get Fooled by Three Extra Nodes in a Database
029
Why Forty-Eight Percent on FrontierMath Isn't the Real Story in DeepMind's New Math Paper
020
The Compliance Gap: Why AI Says Yes and Does No
019
When the Best Reward Model Trains the Worst Policy: Inside EvoLM
017
When the Agent Grades Its Own Homework: A Brutal New Benchmark for AI Workers
015
The Audit Number Isn't What You Think: Sycophancy and the Case Against Single-Prompt Bias Tests
008
Why Long-Horizon AI Agents Get Stuck, and a Milestone-Based Fix That Helps
007
Exploration Hacking: When Models Sabotage Their Own RL Training
001
When AI Models Quietly Protect Each Other From Shutdown

Related concepts

AI & Security Creation-Audit Loop Deliberative Alignment Structured Trace Formatting