Likelihood scoring for continuations of mathematical text: a self-supervised benchmark with tests for shortcut vulnerabilities

Daniel Ranard

arXiv:2605.10810·cs.LG·May 18, 2026

Likelihood scoring for continuations of mathematical text: a self-supervised benchmark with tests for shortcut vulnerabilities

Daniel Ranard

PDF

TL;DR

This paper introduces a self-supervised benchmark for evaluating likelihood scoring models on mathematical text continuations, testing their ability to transmit meaningful information versus shortcuts.

Contribution

It presents a novel, label-free benchmark for assessing models' reasoning capabilities and shortcut vulnerabilities in mathematical text prediction tasks.

Findings

01

GPT-5.5 outperforms other models in likelihood scoring.

02

Likelihood scores distinguish model families and reasoning effort.

03

Fine-tuned scorers help identify shortcut vulnerabilities.

Abstract

We introduce an automatically generated benchmark for predicting hidden text in technical papers. A paper supplies visible context $X$ and a hidden continuation $Y$ ; the evaluated model writes an auxiliary forecast string $Z$ , and a separate scorer assigns next-token probability to $Y$ both with and without conditioning on $Z$ . This gives a label-free test of whether $Z$ transmits information about the continuation, compared against controls where $Z$ is recent context rather than a forecast. Our main testbed is equation-suffix prediction: the predictor sees context and the first part of a displayed equation, then forecasts the rest. The task mixes surface-level arXiv/TeX text modeling with reasoning-sensitive inference; the suffix is one of many roughly equivalent continuations, so the benchmark is read statistically rather than item-by-item. On 1363 equation continuations from 138…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.