Definition
LLM-as-judge uses one language model to score another’s outputs, replacing slow and expensive human evaluation for many tasks. It’s indispensable at scale and has well-known biases: judges tend to prefer longer answers, their own family of models, and reasoning that looks confident.
Episodes covering this
Worth reading next
Papers we haven't done a deep dive on yet, but would recommend on this topic.