Concept · 10 episode(s)

Probing

Definition

Probing is the family of techniques that trains a small classifier on a model’s internal states to test whether some property is encoded there. Probes are useful diagnostics but slippery as evidence: a strong probe can read structure that the model itself doesn’t use.

Episodes covering this

185
Aligned to Refuse, Built to Tap: When Phone Agents Know the Task Is a Crime and Do It Anyway
It Lied to a Doctor to Buy Poison Ingredients: Quantifying Real-World Misuse of Phone-use Agents
Sun, Chen, Zhou et al. · Fudan University·27 min·Jun 30, 2026
171
The Safety Decision a Model Makes Before It Thinks a Word
Do Thinking Tokens Help with Safety?
Ri, Panigrahi, Arora · Princeton Language and Intelligence·25 min·Jun 25, 2026
158
How Floating-Point Rounding Lets a Model Tell Which Chip It's On — And Misbehave
FloatDoor: Platform-Triggered Backdoors in LLMs
Loose, Sander, Mächtle et al. · University of Luebeck·29 min·Jun 19, 2026
153
Catching a Lie From the Inside, When the Words Look Completely Honest
Rift: A Conflict Signature for Deception in Language Models
Nyoma · Harmonic Labs·26 min·Jun 18, 2026
141
How Two Tokens Reopened a Reasoning Method the Field Had Given Up On
Demystifying Hidden-State Recurrence: Switchable Latent Reasoning with On-Policy Reinforcement Learning
Yang, Chen, Wu et al. · HKUST(GZ)·29 min·Jun 12, 2026
140
When a Reasoning Model Says "Let Me Double-Check" After It's Already Decided
Beyond the Commitment Boundary: Probing Epiphenomenal Chain-of-Thought in Large Reasoning Models
Scalena, Candussio, Bortolussi et al. · University of Groningen / University of Milano-Bicocca·27 min·Jun 12, 2026
098
Finding Millions of Readable Concepts Inside a Real, Deployed AI Model
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
Templeton, Conerly, Marcus et al. · Anthropic·28 min·May 29, 2026
037
Why Hallucination Detectors Miss Stale Facts: A Geometric Story About What Models Know But Don't Say
The Geometry of Forgetting: Temporal Knowledge Drift as an Independent Axis in LLM Representations
Elbadry, Heakl, Zhang et al. · Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)·27 min·May 12, 2026
032
A Sticky-Note for Every Layer: Letting Transformers Remember What They Were Just Thinking
State Stream Transformer (SST) V2: Parallel Training of Nonlinear Recurrence for Latent Space Reasoning
Aviss · Fifth Dimension·23 min·May 09, 2026
004
The Sycophancy Circuit That Survives Alignment Training
LLMs Know They're Wrong and Agree Anyway: The Shared Sycophancy-Lying Circuit
Pandey · Georgia Institute of Technology·29 min·May 01, 2026

Worth reading next

Papers we haven't done a deep dive on yet, but would recommend on this topic.