Glossary · Term

Boundless DAS

Definition

Plain language

An interpretability tool that finds a single internal direction inside a model that lines up with a behavior.

As stated in the literature

A learned low-rank rotation method for distributed alignment search that aligns model activations with a target one-dimensional subspace, used in Judge Circuits to identify the Latent Evaluator's judgment axis.

Why it matters: It moves interpretability from 'this layer matters' to 'this exact direction encodes the judgment', enabling much sharper causal claims and targeted interventions.

For example, Boundless DAS might find a single direction in a model's residual stream that, when rotated to, switches its judgment from 'safe' to 'unsafe'.

Heard on the show

“The probe they use to find this axis — it's a method called Boundless DAS, and the way it works is it trains a low-rank rotation to align activations with a single direction.”

Episode 055 — Why LLM Judges Flip Their Verdicts When You Change the Question Format

Mentioned in 1 episode

055
Why LLM Judges Flip Their Verdicts When You Change the Question Format

Related terms

alignment training Judge Circuits Latent Evaluator