Reinforced Context Order Recovery for Adaptive Reasoning and Planning
Long Ma, Fangwei Zhong, Yizhou Wang

TL;DR
This paper introduces ReCOR, a reinforcement learning framework that adaptively determines token generation order in language models, improving reasoning and planning tasks beyond fixed-order approaches.
Contribution
ReCOR is a novel reinforcement learning method that extracts data-dependent token orders without annotations, enhancing model performance on complex reasoning tasks.
Findings
ReCOR outperforms baseline models on reasoning and planning datasets.
ReCOR sometimes surpasses oracle models with ground-truth order.
Adaptive token ordering improves model tractability in complex tasks.
Abstract
Modern causal language models, followed by rapid developments in discrete diffusion models, can now produce a wide variety of interesting and useful content. However, these families of models are predominantly trained to output tokens with a fixed (left-to-right) or random order, which may deviate from the logical order in which tokens are generated originally. In this paper, we observe that current causal and diffusion models encounter difficulties in problems that require adaptive token generation orders to solve tractably, which we characterize with the -information framework. Motivated by this, we propose Reinforced Context Order Recovery (ReCOR), a reinforcement-learning-based framework to extract adaptive, data-dependent token generation orders from text data without annotations. Self-supervised by token prediction statistics, ReCOR estimates the hardness of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsContext-Aware Activity Recognition Systems · AI-based Problem Solving and Planning · Reinforcement Learning in Robotics
