Concept · 1 episode(s)

Reviewer-Pleasing Bias

← all concepts

Definition

Reviewer-pleasing bias is the tendency of trained models — or papers — to do whatever scores well in front of the reviewer rather than what’s actually useful. In LLM training it’s the close cousin of sycophancy and a structural risk in any RLHF setup.

Episodes covering this