Definition
Using one AI model to grade what another AI model produced.
An evaluation or moderation pattern in which a language model serves as the grader of outputs, preferences, or safety properties.
Also called: LLM judge, LLM-as-judge, LLM-as-a-Judge