TL;DR
MOTIF introduces a reinforcement learning fine-tuning approach enabling large language models to perform modular, multi-round reasoning beyond their context size limits, improving accuracy efficiently.
Contribution
The paper presents MOTIF, a novel RL fine-tuning method that enhances LLM reasoning by enabling multi-round thinking over larger contexts, with improved accuracy and sample efficiency.
Findings
Achieved 3.8% and 3.3% accuracy improvements on benchmarks.
Demonstrated effective reasoning beyond context size limits.
Sample-efficient training with only 15% of data.
Abstract
Recent advancements in the reasoning capabilities of large language models (LLMs) show that employing group relative policy optimization (GRPO) algorithm for reinforcement learning (RL) training allows the models to use more thinking/reasoning tokens for generating better responses. However, LLMs can generate only a finite amount of tokens while maintaining attention to the previously generated tokens. This limit, also known as the context size of an LLM, is a bottleneck in LLM reasoning with arbitrarily large number of tokens. To think beyond the limit of context size, an LLM must employ a modular thinking strategy to reason over multiple rounds. In this work, we propose -- an RL training method for generating thinking tokens in multiple rounds, effectively allowing the model to think with additional context size. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
