Definition
A benchmark that grades language models on how well they predict real-world events.
An LLM forecasting benchmark covering political, economic, and event-prediction questions, scored primarily with Brier-style threshold metrics that are blind to upper-tail commitment failures.