Concept · 49 episode(s)

LLM-as-Judge

Definition

LLM-as-judge uses one language model to score another’s outputs, replacing slow and expensive human evaluation for many tasks. It’s indispensable at scale and has well-known biases: judges tend to prefer longer answers, their own family of models, and reasoning that looks confident.

Episodes covering this

207
An AI Graded Its Own Math Test 94 Percent — It Actually Scored 20
More Convincing, Not More Correct: Self-Play Reward Hacking of Reference-Free LLM Judges
· ·12 min·Jul 08, 2026
205
The Same AI, Two Labels: How the Pitch Beat the Product in 162 Sessions
Rating the Pitch, Not the Product: User Evaluations of LLMs Reflect Expectations More Than Performance
· ·13 min·Jul 07, 2026
202
How Do You Know an AI Agent Actually Refused? Check the World, Not the Words
Safety Testing LLM Agents at Scale: From Risk Discovery to Evidence-Grounded Verification
Feng, Lin, Wen et al. · AntGroup / Hunan Institute of Advanced Technology·18 min·Jul 06, 2026
201
One in Four NeurIPS Papers Cites a Reference That Doesn't Exist
Phantom References: Hallucinated Citations That Survive Peer Review at Top-Tier Conferences
Russinovich, Kumar, Salem · Microsoft·19 min·Jul 06, 2026
197
Twin Problems Suggest AI Reasoning Gains Are Mostly Better Fact Recall
IsoSci: A Benchmark of Isomorphic Cross-Domain Science Problems for Evaluating Reasoning versus Knowledge Retrieval in LLMs
Abdaljalil, Serpedin, Kurban · Texas A&M University·17 min·Jul 03, 2026
196
AI Agents Reached Opposite Conclusions From the Same Data — and Passed Review
The Agentic Garden of Forking Paths
Miao, Pritchard, Zou · Stanford University·18 min·Jul 03, 2026
195
Why 'Be Careful' Does Nothing for AI Coding Agents, and What Does
Coding Agents Are Guessing: Measuring Action-Boundary Violations in Underspecified DevOps Instructions
Ji, Zhang, Xu et al. · Hong Kong University of Science and Technology·15 min·Jul 03, 2026
192
A 32B Open Model Matched Frontier Systems By Learning to Take Notes
AutoMem: Automated Learning of Memory as a Cognitive Skill
Wu, Zhu, Zhang et al. · Stanford University·22 min·Jul 02, 2026
191
How One Researcher Beat GPT-5.2 and Gemini 3 by Judging Their Answers, Not Improving Them
Modality-Driven Search with Holistic Trace Judging for ARC-AGI-2
Land · Independent Researcher·26 min·Jul 02, 2026
190
The Skill Every AI Manager Is Missing: Handing Out Exactly the Right Keys
ClawArena-Team: Benchmarking Subagent Orchestration and Dynamic Workflows in Language-Model Agents
Xiong, Ji, Qiu et al. · UNC Chapel Hill·21 min·Jul 02, 2026
189
Why Phone Agents Ace the Test and Crash on Your Actual Phone
Xiaomi-GUI-0 Technical Report
Team, Qu, Luan · Xiaomi·24 min·Jul 02, 2026
188
A Coding Agent Found a Hole in a Peer-Reviewed STOC Proof for Five Dollars
Beyond the Library: An Agentic Framework for Autoformalizing Research Mathematics
Moakhar, Gholami, Springer et al. · University of Maryland·20 min·Jul 02, 2026
184
An AI Built an Undetectable Secret Channel, And Another AI Couldn't Find It
Tool Use Enables Undetectable Steganography in Multi-Agent LLM Systems
Rippin, Marshall, Africa et al. · Oxford University·19 min·Jun 30, 2026
178
How an AI Reviewer Learned to Stop Going Easy on AI Writing
The Red Queen Gödel Machine: Co-Evolving Agents and Their Evaluators
Iacob, Jovanović, Shen et al. · University of Cambridge·23 min·Jun 26, 2026
176
An AI Designed Its Own Psychology Studies, Then Confirmed What It Found
Closing the Loop to Discover Psychological Theories with an Automated Cognitive Scientist
Jagadish, Strittmatter, Jacoby et al. · Princeton University·31 min·Jun 26, 2026
173
The Free Step-Level Grader Hiding in Every RL Training Run
Neglected Free Lunch from Post-training: Progress Advantage for LLM Agents
Oh, Li, Park et al. · University of Wisconsin–Madison·22 min·Jun 25, 2026
169
Why Better Bug Reports Can Make AI Coding Agents Worse
SHERLOC: Structured Diagnostic Localization for Code Repair Agents
Tamoyan, Narenthiran, Arakelyan et al. · NVIDIA / TU Darmstadt·24 min·Jun 24, 2026
167
How Teaching an AI to Predict, Not Act, Made It a Better Actor
Qwen-AgentWorld: Language World Models for General Agents
Team, Zuo, Xiao et al. · ·27 min·Jun 24, 2026
155
Why a Flawless Demo Makes a Worse Computer-Using Agent, And the Fix
Skill-Guided Continuation Distillation for GUI Agents
Fan, Yu, Shen et al. · StepFun·22 min·Jun 18, 2026
152
Training a Model to Mean What It Says, And Why That Isn't the Same as Being Good
Self-CTRL: Self-Consistency Training with Reinforcement Learning
Pres, Ruis, Ghebreselassie et al. · MIT CSAIL·26 min·Jun 18, 2026
147
Agents Fail at the Body, Not the Brain: A Self-Rewriting Scaffold That Lifts a 9B Model 44 Points
HarnessX: A Composable, Adaptive, and Evolvable Agent Harness Foundry
Chen, Lu, Zhao et al. · ·30 min·Jun 15, 2026
146
How an Innocent README Can Freeze an AI Agent's Safety Check for an Hour
From Shield to Target: Denial-of-Service Attacks on LLM-Based Agent Guardrails
Zhou, Wang, Ma et al. · Hong Kong University of Science and Technology·26 min·Jun 15, 2026
133
How MiniMax Turned a Reward-Hacking Disaster Into Olympiad Gold
MaxProof: Scaling Mathematical Proof with Generative-Verifier RL and Population-Level Test-Time Scaling
Chen, Zhang, Zhang et al. · MiniMax / The Chinese University of Hong Kong·34 min·Jun 12, 2026
132
The Agent Failed — But Did the Instructions Deserve to Be Followed?
SkillAxe: Sharpening LLM-Authored Agent Skills Through Evaluation-Guided Self-Refinement
Gautam, Radhakrishna, Gulwani · Microsoft·30 min·Jun 11, 2026
125
AI Coding Agents Run a Marathon, and Fewer Than One in Three Finish
SWE-Marathon: Can Agents Autonomously Complete Ultra-Long-Horizon Software Work?
Desai, Hu, Cabezas et al. · Abundant·27 min·Jun 09, 2026
124
A Cheap Model With the Blueprints Beats Expensive Models Working Blind
Hardening Agent Benchmarks with Adversarial Hacker-Fixer Loops
Zhong, Segal, Bercovich et al. · Carnegie Mellon University·27 min·Jun 09, 2026
123
Five Identical Worlds, One Swapped Model: What Happens When AI Agents Run for Fifteen Days
Emergence World: A Platform for Evaluating Long-Horizon Multi-Agent Autonomy
Akkil, Kokku, Vikram et al. · Emergence AI·30 min·Jun 09, 2026
122
When Your Coding Agent Lies About the Fix: Verifying the Plan Before the Model Runs
Lean4Agent: Formal Modeling and Verification for Agent Workflow and Trajectory
Wang, Huang, Wang et al. · University of Illinois Urbana-Champaign·24 min·Jun 09, 2026
120
How an AI Agent Rewrites Its Own Tools, Without an Answer Key
Retrospective Harness Optimization: Improving LLM Agents via Self-Preference over Trajectory Rollouts
Pan, Liu, Lin et al. · City University of Hong Kong·30 min·Jun 05, 2026
118
Why the Best-Aligned AI Models Are the Easiest to Trick Into Producing Harm
Safety Paradox: How Enhanced Safety Awareness Leaves LLMs Vulnerable to Posterior Attack
Hoang, Le, Xu et al. · Singapore University of Technology and Design·23 min·Jun 05, 2026
111
How a 4B Web Agent Beat Models 60x Its Size on 500 Demonstrations
OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for Visual Web Agents
Yang, Wu, Chen et al. · UIUC·24 min·Jun 03, 2026
104
How Making a Research Agent Smarter Quietly Makes It Leak Your Secrets
MosaicLeaks:Privacy Risks in Querying-in-the-Open for Deep Research Agents
Gurung, Gella, Drouin et al. · University of Edinburgh·25 min·Jun 01, 2026
103
AI Agents Tried to Invent a Post-Human Language, And Reinvented Cherokee
Emergent Languages in Populations of Language Model Agents: From Token Efficiency to Oversight Evasion
Beltoft, Brach, Torrielli et al. · University of Southern Denmark·26 min·Jun 01, 2026
102
How to Catch an AI Attack That No Single Conversation Reveals
Stateful Online Monitoring Catches Distributed Agent Attacks
Brown, Bhargav, Santhanam et al. · University of Pennsylvania·24 min·Jun 01, 2026
089
When AI-Written Papers Read Well But the Evidence Underneath Is Broken
ScientistOne: Towards Human-Level Autonomous Research via Chain-of-Evidence
Meng, Mishra, Chen et al. · Google Cloud AI Research·32 min·May 27, 2026
087
When No Agent Reads the Whole Document: A Universal Cliff in Multi-Agent Review
A Universal Cliff and a Design Fingerprint: Cross-Section Defect Detection Under LLM Orchestration
Fukui · Research Institute of Criminal Psychiatry·26 min·May 27, 2026
082
Training a Deep Research Agent on 8,000 Synthetic Tasks: The Rubric Tree Trick
QUEST: Training Frontier Deep Research Agents with Fully Synthetic Tasks
Xie, Lin, Wang et al. · The Ohio State University·31 min·May 26, 2026
079
An Old Idea From Cognitive Psychology Reshapes How We Reward Reasoning Models
Metacognition as Reward: Reinforcing LLM Reasoning via Knowledge and Regulation Signals
Chen, Xu, Zhao et al. · Tongji University / Shanghai AI Laboratory / Nanyang Technological University·29 min·May 25, 2026
062
Treating Hallucinations as Exploits: A Gate-Based Architecture for Agent Safety
Hallucination as Exploit: Evidence-Carrying Multimodal Agents
Zhang, Zheng, Yang · Shenzhen University·24 min·May 20, 2026
059
Firefly's Inversion: Building Verified Tool-Call Training Data by Working Backward
Firefly: Illuminating Large-Scale Verified Tool-Call Data Generation from Real APIs
Lu, Wang, Lu et al. · Northeastern University·22 min·May 20, 2026
058
Why Upgrading Your AI Auditor to a Smarter Model Can Make Your System Less Safe
The Capability Paradox: How Smarter Auditors Make Multi-Agent Systems Less Secure
Liu, Holz, Ye et al. · University of Chinese Academy of Sciences·32 min·May 19, 2026
057
How Uber Caught 206 Leaked Credentials With an LLM-Powered Security Stack
ADR: An Agentic Detection System for Enterprise Agentic AI Security
Li, Hu, Xu et al. · Uber Technologies·28 min·May 19, 2026
055
Why LLM Judges Flip Their Verdicts When You Change the Question Format
Judge Circuits
Feldhus, Baeumel, Golimblevskaia et al. · Technische Universität Berlin / BIFOLD·26 min·May 19, 2026
045
When a Frontier Model Talks Its Own Twin Into Climate Denial
LLM-Based Persuasion Enables Guardrail Override in Frontier LLMs
Nogueira, Almeida, Bonás et al. · Maritaca AI·31 min·May 15, 2026
031
When Your AI Assistant Won't Let Go of Old Facts About You
STALE: Can LLM Agents Know When Their Memories Are No Longer Valid?
Chao, Bai, Sheng et al. · Wuhan University·24 min·May 09, 2026
028
Teaching a Model to Hire Copies of Itself: Recursive Agent Optimization
Recursive Agent Optimization
Gandhi, Chakraborty, Wang et al. · Carnegie Mellon University·23 min·May 08, 2026
020
The Compliance Gap: Why AI Says Yes and Does No
The Compliance Gap: Why AI Systems Promise to Follow Process Instructions but Don't
Shin · Polymath Minds AI Lab·28 min·May 06, 2026
019
When the Best Reward Model Trains the Worst Policy: Inside EvoLM
EvoLM: Self-Evolving Language Models through Co-Evolved Discriminative Rubrics
Li, Xin, Xiao et al. · University of Washington·26 min·May 06, 2026
003
How to Pick the Best of Sixteen Coding Agent Rollouts
Scaling Test-Time Compute for Agentic Coding
Kim, Yang, Niu et al. · Meta Superintelligence Labs / University of Washington·17 min·May 01, 2026

Worth reading next

Papers we haven't done a deep dive on yet, but would recommend on this topic.