Glossary · Term

linear probe

← all terms

Definition

A small simple classifier trained on a model's internal states to test what information they contain.

A linear classifier trained on frozen intermediate activations to detect whether a particular concept is linearly decodable from the representation.

Also called: linear probing, linear classifier, linear probes, probe, probes

Mentioned in 23 episodes

  1. 077
    Reading a Model's Confidence Curve to Decide When Chain-of-Thought Is Worth It
  2. 073
    When Three LLMs Talk to Each Other, Their Ideas Quietly Stop Moving
  3. 070
    When Models Know the Answer But Say the Wrong Thing Anyway
  4. 069
    When Smarter Models Forecast Worse: The Hidden Failure Mode in LLM Predictions
  5. 061
    When Helpful Agents Go Sideways: A 404 Error, Campus Security, and Why Alignment Misses This
  6. 055
    Why LLM Judges Flip Their Verdicts When You Change the Question Format
  7. 054
    When Models Learn the Monitor Exists, the Reasoning Trace Stops Being a Window
  8. 051
    Why Parallel Sampling Plateaus, And What Evidence Graphs Do Instead
  9. 043
    When 'This Is False' Doesn't Stick: Why Models Learn the Lie Anyway
  10. 037
    Why Hallucination Detectors Miss Stale Facts: A Geometric Story About What Models Know But Don't Say
  11. 032
    A Sticky-Note for Every Layer: Letting Transformers Remember What They Were Just Thinking
  12. 031
    When Your AI Assistant Won't Let Go of Old Facts About You
  13. 030
    Why Your AI Agent Won't Stop Working — and Each Model Falls for a Different Trap
  14. 022
    Training the Model Spec Directly: An Alignment Lever Aimed at the Say-Do Gap
  15. 018
    Language Models Compute the Rational Move, Then Override It
  16. 015
    The Audit Number Isn't What You Think: Sycophancy and the Case Against Single-Prompt Bias Tests
  17. 012
    Why AI Coding Agents Keep Trying to Debug Without a Debugger
  18. 011
    When RL Actually Teaches Agents Something New, And When It Doesn't
  19. 007
    Exploration Hacking: When Models Sabotage Their Own RL Training
  20. 006
    What Happens Inside Claude When It Decides to Blackmail Someone
  21. 005
    Why a Debugger Designed for Humans Is the Wrong Tool for an AI Agent
  22. 004
    The Sycophancy Circuit That Survives Alignment Training
  23. 002
    An AI Ran a Real Optics Lab for 21 Hours and Found a Transformer-Shaped Pattern in Light

Related concepts