Likelihood scoring for continuations of mathematical text: a self-supervised benchmark with tests for shortcut vulnerabilities
Daniel Ranard

TL;DR
This paper introduces a self-supervised benchmark for evaluating likelihood scoring models on mathematical text continuations, testing their ability to transmit meaningful information versus shortcuts.
Contribution
It presents a novel, label-free benchmark for assessing models' reasoning capabilities and shortcut vulnerabilities in mathematical text prediction tasks.
Findings
GPT-5.5 outperforms other models in likelihood scoring.
Likelihood scores distinguish model families and reasoning effort.
Fine-tuned scorers help identify shortcut vulnerabilities.
Abstract
We introduce an automatically generated benchmark for predicting hidden text in technical papers. A paper supplies visible context and a hidden continuation ; the evaluated model writes an auxiliary forecast string , and a separate scorer assigns next-token probability to both with and without conditioning on . This gives a label-free test of whether transmits information about the continuation, compared against controls where is recent context rather than a forecast. Our main testbed is equation-suffix prediction: the predictor sees context and the first part of a displayed equation, then forecasts the rest. The task mixes surface-level arXiv/TeX text modeling with reasoning-sensitive inference; the suffix is one of many roughly equivalent continuations, so the benchmark is read statistically rather than item-by-item. On 1363 equation continuations from 138…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
