Consistency Checks for Language Model Forecasters
Published in International Conference on Learning Representations (ICLR), 2025
This paper tackles the challenge of evaluating LLM forecasters, especially for long-range predictions where the ground truth is years away. We propose evaluating them not on accuracy (which we can’t know) but on their logical consistency. While recent work shows LLMs are approaching human-level forecasting, evaluating them is a key bottleneck. Prior work on consistency was often limited to simple checks using ad-hoc metrics that are difficult to compare or aggregate.
We introduce a principled new consistency metric based on arbitrage: if a forecaster is illogical (e.g., says two mutually exclusive events are both 60% likely), an arbitrageur could make a guaranteed profit. We develop an automated pipeline to generate tuples of logically-related questions (covering negation, conditional probability, etc.) and measure this “arbitrage violation”. We show that our instantaneous consistency scores strongly correlate with the forecaster’s true future accuracy (Brier score). We also introduce ArbitrageForecaster, a method to improve consistency, and release a long-horizon benchmark with questions that resolve in 2028.
Additional information could be found in this Twitter thread.
