Glossary · Term

audit

← all terms

Definition

A controlled experiment that probes a deployed AI system for capabilities or propensities of concern.

In safety evaluation, a structured behavioral probe of frontier models across scaffolding levels designed to estimate capability and propensity for a target failure mode like exploration hacking or peer preservation.

Mentioned in 17 episodes

  1. 082
    Training a Deep Research Agent on 8,000 Synthetic Tasks: The Rubric Tree Trick
  2. 078
    Training a Markdown File: When LLM Self-Improvement Borrows the Discipline of Neural Net Training
  3. 075
    Growing Code and Proof Together: Verified Systems in Ten Hours Instead of a Year
  4. 062
    Treating Hallucinations as Exploits: A Gate-Based Architecture for Agent Safety
  5. 061
    When Helpful Agents Go Sideways: A 404 Error, Campus Security, and Why Alignment Misses This
  6. 058
    Why Upgrading Your AI Auditor to a Smarter Model Can Make Your System Less Safe
  7. 054
    When Models Learn the Monitor Exists, the Reasoning Trace Stops Being a Window
  8. 049
    An AI Agent Reached for Root in Twelve Minutes, Without Being Attacked
  9. 039
    When Smarter Agents Get Fooled by Three Extra Nodes in a Database
  10. 029
    Why Forty-Eight Percent on FrontierMath Isn't the Real Story in DeepMind's New Math Paper
  11. 020
    The Compliance Gap: Why AI Says Yes and Does No
  12. 019
    When the Best Reward Model Trains the Worst Policy: Inside EvoLM
  13. 017
    When the Agent Grades Its Own Homework: A Brutal New Benchmark for AI Workers
  14. 015
    The Audit Number Isn't What You Think: Sycophancy and the Case Against Single-Prompt Bias Tests
  15. 008
    Why Long-Horizon AI Agents Get Stuck, and a Milestone-Based Fix That Helps
  16. 007
    Exploration Hacking: When Models Sabotage Their Own RL Training
  17. 001
    When AI Models Quietly Protect Each Other From Shutdown

Related concepts