Next-Token Prediction Task Assumes Optimal Data Ordering for LLM Training in Proof Generation

Chenyang An; Shima Imani; Feng Yao; Chengyu Dong; Ali Abbasi; Harsh Shrivastava; Samuel Buss; Jingbo Shang; Gayathri Mahalingam; Pramod Sharma; Maurice Diesendruck

arXiv:2411.00863·cs.CL·July 4, 2025

Next-Token Prediction Task Assumes Optimal Data Ordering for LLM Training in Proof Generation

Chenyang An, Shima Imani, Feng Yao, Chengyu Dong, Ali Abbasi, Harsh Shrivastava, Samuel Buss, Jingbo Shang, Gayathri Mahalingam, Pramod Sharma, Maurice Diesendruck

PDF

Open Access

TL;DR

This paper shows that training large language models for proof generation is significantly improved when proofs are ordered sequentially, with an 11% success rate increase, highlighting the importance of data ordering in learning efficiency.

Contribution

It introduces the concept of intuitively sequential data order for proof training and demonstrates its effectiveness through experiments on logic and math proofs.

Findings

01

Training with sequential proof order improves success rates by up to 11%.

02

Suboptimal proof ordering is common in existing datasets, affecting model performance.

03

A significant portion of advanced math proofs suffer from order issues, impacting learning.

Abstract

In the field of large language model (LLM)-based proof generation, despite extensive training on large datasets such as ArXiv, LLMs still exhibit only modest performance on proving tasks of moderate difficulty. We believe that this is partly due to the widespread presence of suboptimal ordering within the data for each proof used in training. For example, published proofs often follow a purely logical order, where each step logically proceeds from the previous steps based on the deductive rules. This order is designed to facilitate the verification of the proof's soundness, rather than to help people and models learn the discovery process of the proof. In proof generation, we argue that the optimal order for one training data sample occurs when the relevant intermediate supervision for a particular proof step in the proof is always positioned to the left of that proof step. We call such…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMathematics, Computing, and Information Processing