Glossary · Term

Format Transfer Injection

Definition

Plain language

A test where researchers copy a model's mid-computation state from one prompt into another to see if it changes the answer.

As stated in the literature

An interpretability intervention that transplants intermediate activations from one prompt format (e.g., rating) into another (e.g., classification) to test whether judgment is portable across output formats.

Also called: FTI

Why it matters: It's a way to test whether a model's underlying judgment is the same across output formats or whether the format is silently changing what it 'thinks.'

For example, you capture the internal state a model uses when scoring an essay 1-to-10 and paste it into a separate run that's asked to label the essay 'good' or 'bad,' then check if the label matches.

Heard on the show

“They call it Format Transfer Injection — let's just call it the transplant.”

Episode 055 — Why LLM Judges Flip Their Verdicts When You Change the Question Format

Mentioned in 1 episode

055
Why LLM Judges Flip Their Verdicts When You Change the Question Format