Glossary · Term

ForecastBench

Definition

Plain language

A benchmark that grades language models on how well they predict real-world events.

As stated in the literature

An LLM forecasting benchmark covering political, economic, and event-prediction questions, scored primarily with Brier-style threshold metrics that are blind to upper-tail commitment failures.

Why it matters: It pushes language models to give calibrated probability estimates rather than confident yes/no answers, exposing where their reasoning breaks down.

For example, ForecastBench might ask a model to estimate the probability that a particular election outcome occurs by year-end, then score it against what actually happens.

Heard on the show

“The big LLM forecasting benchmarks — ForecastBench, KalshiBench, others — are built around questions that are naturally binary.”

Episode 069 — When Smarter Models Forecast Worse: The Hidden Failure Mode in LLM Predictions

Mentioned in 1 episode

069
When Smarter Models Forecast Worse: The Hidden Failure Mode in LLM Predictions

Related terms

commitment failure