Reasoning over mathematical objects: on-policy reward modeling and test time aggregation

Pranjal Aggarwal; Marjan Ghazvininejad; Seungone Kim; Ilia Kulikov; Jack Lanchantin; Xian Li; Tianjian Li; Bo Liu; Graham Neubig; Anaelia Ovalle; Swarnadeep Saha; Sainbayar Sukhbaatar; Sean Welleck; Jason Weston; Chenxi Whitehouse; Adina Williams; Jing Xu; Ping Yu; Weizhe Yuan; Jingyu Zhang; Wenting Zhao

arXiv:2603.18886·cs.AI·March 20, 2026

Reasoning over mathematical objects: on-policy reward modeling and test time aggregation

Pranjal Aggarwal, Marjan Ghazvininejad, Seungone Kim, Ilia Kulikov, Jack Lanchantin, Xian Li, Tianjian Li, Bo Liu, Graham Neubig, Anaelia Ovalle, Swarnadeep Saha, Sainbayar Sukhbaatar, Sean Welleck, Jason Weston, Chenxi Whitehouse, Adina Williams, Jing Xu, Ping Yu, Weizhe Yuan

PDF

Open Access

TL;DR

This paper introduces new training data, evaluation benchmarks, and methods for improving language models' ability to reason over mathematical objects, demonstrating enhanced performance and generalization across formats.

Contribution

It provides the Principia suite for training and benchmarking, develops on-policy judge training techniques, and shows how test-time aggregation scales reasoning capabilities.

Findings

01

Strong LLMs struggle on Principia benchmarks

02

Training recipes significantly improve reasoning performance

03

On-policy training enhances cross-format reasoning generalization

Abstract

The ability to precisely derive mathematical objects is a core requirement for downstream STEM applications, including mathematics, physics, and chemistry, where reasoning must culminate in formally structured expressions. Yet, current LM evaluations of mathematical and scientific reasoning rely heavily on simplified answer formats such as numerical values or multiple choice options due to the convenience of automated assessment. In this paper we provide three contributions for improving reasoning over mathematical objects: (i) we build and release training data and benchmarks for deriving mathematical objects, the Principia suite; (ii) we provide training recipes with strong LLM-judges and verifiers, where we show that on-policy judge training boosts performance; (iii) we show how on-policy training can also be used to scale test-time compute via aggregation. We find that strong LMs…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMathematics, Computing, and Information Processing · Topic Modeling · Machine Learning in Materials Science