Definition
A benchmark of medical questions used to test how well models reason about health topics.
An evaluation suite of medical question-answering and clinical-reasoning tasks, used in EvoLM-style cross-domain transfer experiments to test whether learned evaluative rubrics generalize to expert health domains.