Glossary · Term

Boundless DAS

← all terms

Definition

An interpretability tool that finds a single internal direction inside a model that lines up with a behavior.

A learned low-rank rotation method for distributed alignment search that aligns model activations with a target one-dimensional subspace, used in Judge Circuits to identify the Latent Evaluator's judgment axis.

Mentioned in 1 episode

  1. 055
    Why LLM Judges Flip Their Verdicts When You Change the Question Format