Surrogate Signals from Format and Length: Reinforcement Learning for Solving Mathematical Problems without Ground Truth Answers

Rihui Xin; Han Liu; Zecheng Wang; Yupeng Zhang; Dianbo Sui; Xiaolin Hu; Bingning Wang

arXiv:2505.19439·cs.CL·February 2, 2026

Surrogate Signals from Format and Length: Reinforcement Learning for Solving Mathematical Problems without Ground Truth Answers

Rihui Xin, Han Liu, Zecheng Wang, Yupeng Zhang, Dianbo Sui, Xiaolin Hu, Bingning Wang

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper demonstrates that simple surrogate signals like format and length can effectively guide reinforcement learning in mathematical problem solving with LLMs, reducing reliance on costly ground truth answers.

Contribution

It introduces a format-length surrogate signal approach for RL training that can match or surpass ground-truth-based optimization in mathematical reasoning tasks.

Findings

01

Early training is dominated by format learning.

02

Length-based rewards improve output quality.

03

Method achieves 40.0% accuracy on AIME2024 with a 7B model.

Abstract

Large Language Models (LLMs) have achieved remarkable success in natural language processing tasks, with Reinforcement Learning (RL) playing a key role in adapting them to specific applications. In mathematical problem solving, however, the reliance on ground truth answers poses significant challenges due to their high collection cost and limited availability. This work explores the use of simple surrogate signals, format and length, to guide RL training. We find that early training is dominated by format learning, where structural feedback alone accounts for most performance gains. Incorporating length-based rewards further refines outputs by discouraging overly long or short responses, enabling a GRPO approach with format-length signals to approximate, and in some cases surpass, ground-truth-based optimization. For example, our method achieves 40.0% accuracy on AIME2024 with a 7B…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

- The paper is generally well-written and experimental details are clear. - Results are across multiple datasets with care taken regarding decontamination and across model sizes and families. - Authors provide additional qualitative analysis (response-length trends, reflective-word frequencies, etc) and ablations (eg. RL vs SFT for format learning) that align with their findings.

Weaknesses

- My main concern is the novelty of the work; the author’s central claim– that simple structure-based signals (format, length) can substitute correctness in GRPO– are preceded by several prior works which study RLVR without external supervision rewards [1,2] as well as study whether RL is eliciting capabilities beyond what can be achieved by the base model [3,4,5], which are not cited in this work (the above examples I gave is not exhaustive– the authors do cite [6]). - In particular, [2] shows

Reviewer 02Rating 4Confidence 4

Strengths

The main strengths of this paper: - the losses proposed by the authors do not require ground truth labels and are simple and interpretable - the proposed method achieves a large boost quite quickly in training, showing that a lot of performance can be extracted by supervising the formatting and length

Weaknesses

The main weaknesses of this paper: - In general, I am not sure supervising just for formatting and length would be enough on more complicated math proofs/code, and more importantly, much longer traces - I am not sure exactly what the main message of this paper is meant to be. The authors' experiments show that rewarding for correctness achieves at least the same performance as rewarding for formatting and length on these tasks (for example, format-only and format-length in Table 2 are much worse

Reviewer 03Rating 4Confidence 4

Strengths

1. This paper provides a concrete, clear, reproducible reward design. 2. The paper empirically shows a two-phase pattern: (1) early gains from learning the output format (2) followed by further gains from length control 3. The paper is in general lean presented, very easy-to-follow, (e.g., Qwen-Math prompt format, format checker, GRPO setup) and the formulas are also well-organized.

Weaknesses

1. Current math-QA benchmarks already provide relatively rich verifiable signals; harder settings (e.g., proof generation, open-ended reasoning) are underexplored so far. So I do hope the authors can extend the experiments to these tasks. 2. Results look Qwen-centric with unclear generalization to Qwen3/Llama/Mistral. In fact, there are several papers showing that qwen2.5 series of models can be successfully "RL"ed even with spurious rewards. I really want to see results on more models. 3. Wil

Code & Models

Repositories

insightllm/rl-without-gt
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Neural Networks and Applications · Imbalanced Data Classification Techniques

MethodsBalanced Selection