BOW: Reinforcement Learning for Bottlenecked Next Word Prediction
Ming Shen, Zhikun Xu, Jacob Dineen, Xiao Ye, Ben Zhou

TL;DR
This paper introduces BOW, an RL-based method for next-word prediction that incorporates an intermediate reasoning step, improving models' explicit reasoning and overall performance on various benchmarks.
Contribution
BOW is a novel RL formulation of NWP with an intermediate reasoning bottleneck, enhancing reasoning capabilities over traditional models.
Findings
BOW improves zero-shot reasoning by nearly 5% on average.
BOW outperforms RL with binary rewards and supervised finetuning in 7 out of 10 benchmarks.
BOW induces explicit reasoning, strengthening general reasoning ability.
Abstract
Large language models (LLMs) are typically pretrained with next-word prediction (NWP), which yields strong surface fluency but places limited pressure on models to form explicit reasoning before emitting tokens. We study whether shifting the supervision signal can better elicit explicit reasoning and, more broadly, strengthen models' general reasoning capability. We present BOttlenecked next-Word prediction (BOW), a RL formulation of NWP that inserts an intermediate reasoning bottleneck. Instead of predicting the next word directly from context, the policy model must first generate a next-word reasoning trajectory. A frozen scorer then assigns this trajectory a soft, distributional reward equal to the probability of the gold next token conditioned solely on the trajectory to guide the RL optimization. We also propose an optional L1-style regularizer on the reward to discourage…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech and dialogue systems · Topic Modeling
