Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies
Yuqiao Tan, Minzheng Wang, Shizhu He, Huanxuan Liao, Chengfeng Zhao, Qiunan Lu, Tian Liang, Jun Zhao, Kang Liu

TL;DR
This paper reveals the internal policy structures of large language models, showing how internal layer optimization can improve reasoning capabilities, and introduces a bottom-up RL method called BuPO that enhances model reasoning through internal layer training.
Contribution
The paper introduces a novel decomposition of LLM policies into internal layer policies and proposes BuPO, a bottom-up RL approach that optimizes internal layers for better reasoning.
Findings
Internal policies evolve from exploration to refinement across layers.
Qwen exhibits human-like reasoning structure, unlike Llama.
BuPO improves reasoning performance on benchmarks.
Abstract
Existing reinforcement learning (RL) approaches treat large language models (LLMs) as a unified policy, overlooking their internal mechanisms. In this paper, we decompose the LLM-based policy into Internal Layer Policies and Internal Modular Policies via Transformer's residual stream. Our entropy analysis on internal policy reveals distinct patterns: (1) universally, policies evolve from high-entropy exploration in early layers to deterministic refinement in top layers; and (2) Qwen exhibits a progressive, human-like reasoning structure, contrasting with the abrupt final-layer convergence in Llama. Furthermore, we discover that optimizing internal layers induces feature refinement, forcing lower layers to capture high-level reasoning representations early. Motivated by these findings, we propose Bottom-up Policy Optimization (BuPO), a novel RL paradigm that reconstructs the LLM's…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
