Definition
A score for how much two raters agree, beyond what you'd expect by chance.
An inter-rater agreement statistic correcting raw concordance for chance, commonly used to validate LLM judges against human annotators.
A score for how much two raters agree, beyond what you'd expect by chance.
An inter-rater agreement statistic correcting raw concordance for chance, commonly used to validate LLM judges against human annotators.