Glossary · Term

SAPLMA

Definition

Plain language

A probing method that reads a language model's hidden states to predict whether its current statement is true.

As stated in the literature

Statement-Accuracy Prediction based on Language Model Activations — a supervised probe over hidden states classifying truthfulness of model assertions.

Why it matters: It suggests models often "know" their statement is wrong in their internal activations, opening a path to detecting hallucinations without having to verify against the outside world.

For example, the probe reads the model's hidden state just before it answers "Paris is in Germany" and predicts that this statement is false, even though the model went on to say it.

Heard on the show

“Meanwhile every existing detector they benchmark — token entropy, semantic entropy, CCS, SAPLMA — clusters right around fifty.”

Episode 037 — Why Hallucination Detectors Miss Stale Facts: A Geometric Story About What Models Know But Don't Say

Mentioned in 1 episode

037
Why Hallucination Detectors Miss Stale Facts: A Geometric Story About What Models Know But Don't Say

Related terms

hidden state linear probe