Glossary · Term

KalshiBench

Definition

Plain language

A benchmark that scores AI models on predicting outcomes from a real prediction market.

As stated in the literature

An LLM forecasting evaluation suite built on Kalshi prediction-market questions, reporting Brier-style metrics; cited as another current benchmark whose scoring rules cannot detect distributional commitment failures.

Why it matters: Real prediction-market questions give a public, ongoing test of LLM forecasting that isn't gameable through memorization of static benchmarks.

For example, a model might be scored on whether it correctly assigned 70% probability to a particular candidate winning a primary election listed on Kalshi.

Heard on the show

“The big LLM forecasting benchmarks — ForecastBench, KalshiBench, others — are built around questions that are naturally binary.”

Episode 069 — When Smarter Models Forecast Worse: The Hidden Failure Mode in LLM Predictions

Mentioned in 1 episode

069
When Smarter Models Forecast Worse: The Hidden Failure Mode in LLM Predictions

Related terms

commitment failure