Concept · 12 episode(s)

LLM-as-Judge

← all concepts

Definition

LLM-as-judge uses one language model to score another’s outputs, replacing slow and expensive human evaluation for many tasks. It’s indispensable at scale and has well-known biases: judges tend to prefer longer answers, their own family of models, and reasoning that looks confident.

Episodes covering this

  1. 079
    An Old Idea From Cognitive Psychology Reshapes How We Reward Reasoning Models
    Chen, Xu, Zhao et al. · Tongji University / Shanghai AI Laboratory / Nanyang Technological University·29 min·May 25, 2026
  2. 062
    Treating Hallucinations as Exploits: A Gate-Based Architecture for Agent Safety
    Zhang, Zheng, Yang · Shenzhen University·24 min·May 20, 2026
  3. 059
    Firefly's Inversion: Building Verified Tool-Call Training Data by Working Backward
    Lu, Wang, Lu et al. · Northeastern University·22 min·May 20, 2026
  4. 058
    Why Upgrading Your AI Auditor to a Smarter Model Can Make Your System Less Safe
    Liu, Holz, Ye et al. · University of Chinese Academy of Sciences·32 min·May 19, 2026
  5. 057
    How Uber Caught 206 Leaked Credentials With an LLM-Powered Security Stack
    Li, Hu, Xu et al. · Uber Technologies·28 min·May 19, 2026
  6. 055
    Why LLM Judges Flip Their Verdicts When You Change the Question Format
    Feldhus, Baeumel, Golimblevskaia et al. · Technische Universität Berlin / BIFOLD·26 min·May 19, 2026
  7. 045
    When a Frontier Model Talks Its Own Twin Into Climate Denial
    Nogueira, Almeida, Bonás et al. · Maritaca AI·31 min·May 15, 2026
  8. 031
    When Your AI Assistant Won't Let Go of Old Facts About You
    Chao, Bai, Sheng et al. · Wuhan University·24 min·May 09, 2026
  9. 028
    Teaching a Model to Hire Copies of Itself: Recursive Agent Optimization
    Gandhi, Chakraborty, Wang et al. · Carnegie Mellon University·23 min·May 08, 2026
  10. 020
    The Compliance Gap: Why AI Says Yes and Does No
    Shin · Polymath Minds AI Lab·28 min·May 06, 2026
  11. 019
    When the Best Reward Model Trains the Worst Policy: Inside EvoLM
    Li, Xin, Xiao et al. · University of Washington·26 min·May 06, 2026
  12. 003
    How to Pick the Best of Sixteen Coding Agent Rollouts
    Kim, Yang, Niu et al. · Meta Superintelligence Labs / University of Washington·17 min·May 01, 2026

Worth reading next

Papers we haven't done a deep dive on yet, but would recommend on this topic.