Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies

Yuqiao Tan; Minzheng Wang; Shizhu He; Huanxuan Liao; Chengfeng Zhao; Qiunan Lu; Tian Liang; Jun Zhao; Kang Liu

arXiv:2512.19673·cs.LG·February 3, 2026

Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies

Yuqiao Tan, Minzheng Wang, Shizhu He, Huanxuan Liao, Chengfeng Zhao, Qiunan Lu, Tian Liang, Jun Zhao, Kang Liu

PDF

Open Access

TL;DR

This paper reveals the internal policy structures of large language models, showing how internal layer optimization can improve reasoning capabilities, and introduces a bottom-up RL method called BuPO that enhances model reasoning through internal layer training.

Contribution

The paper introduces a novel decomposition of LLM policies into internal layer policies and proposes BuPO, a bottom-up RL approach that optimizes internal layers for better reasoning.

Findings

01

Internal policies evolve from exploration to refinement across layers.

02

Qwen exhibits human-like reasoning structure, unlike Llama.

03

BuPO improves reasoning performance on benchmarks.

Abstract

Existing reinforcement learning (RL) approaches treat large language models (LLMs) as a unified policy, overlooking their internal mechanisms. In this paper, we decompose the LLM-based policy into Internal Layer Policies and Internal Modular Policies via Transformer's residual stream. Our entropy analysis on internal policy reveals distinct patterns: (1) universally, policies evolve from high-entropy exploration in early layers to deterministic refinement in top layers; and (2) Qwen exhibits a progressive, human-like reasoning structure, contrasting with the abrupt final-layer convergence in Llama. Furthermore, we discover that optimizing internal layers induces feature refinement, forcing lower layers to capture high-level reasoning representations early. Motivated by these findings, we propose Bottom-up Policy Optimization (BuPO), a novel RL paradigm that reconstructs the LLM's…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning