Temperature as a Meta-Policy: Adaptive Temperature in LLM Reinforcement Learning

Haoran Dang; Cuiling Lan; Hai Wan; Xibin Zhao; Yan Lu

arXiv:2602.11779·cs.LG·February 13, 2026

Temperature as a Meta-Policy: Adaptive Temperature in LLM Reinforcement Learning

Haoran Dang, Cuiling Lan, Hai Wan, Xibin Zhao, Yan Lu

PDF

Open Access 3 Reviews

TL;DR

This paper introduces TAMPO, a novel framework that learns to adapt the temperature hyperparameter in LLM reinforcement learning dynamically, improving exploration and policy performance without extra rollouts.

Contribution

The paper presents TAMPO, a hierarchical meta-policy approach that adaptively controls temperature in LLM RL, outperforming fixed or heuristic schedules on reasoning benchmarks.

Findings

01

TAMPO outperforms fixed temperature baselines on five reasoning tasks.

02

Adaptive temperature control improves policy learning efficiency.

03

Meta-policy effectively learns temperature schedules without additional rollouts.

Abstract

Temperature is a crucial hyperparameter in large language models (LLMs), controlling the trade-off between exploration and exploitation during text generation. High temperatures encourage diverse but noisy outputs, while low temperatures produce focused outputs but may cause premature convergence. Yet static or heuristic temperature schedules fail to adapt to the dynamic demands of reinforcement learning (RL) throughout training, often limiting policy improvement. We propose Temperature Adaptive Meta Policy Optimization (TAMPO), a new framework that recasts temperature control as a learnable meta-policy. TAMPO operates through a hierarchical two-loop process. In the inner loop, the LLM policy is updated (e.g., using GRPO) with trajectories sampled at the temperature selected by the meta-policy. In the outer loop, meta-policy updates the distribution over candidate temperatures by…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 3

Strengths

1. The technical approach is sound and well-motivated. The paper identifies a clear limitation of existing LLM RL methods – the use of fixed or manually tuned exploration temperature – and offers a principled solution. 2. Significant practical strength of TAMPO is its decoupled outer-loop update mechanism, which enables online adaptation without additional rollouts。

Weaknesses

1. Failure to Contextualize within the Meta-Gradient Literature: This omission is compounded by the paper's flawed positioning within the meta-learning literature it does cite.The paper does cite "meta-gradient methods" (Xu et al., 2018) for learning other hyperparameters like $\gamma$ and $\lambda$. It fails to cite the direct combination of these two concepts: "Meta-SAC: Auto-tune the entropy temperature of soft actor-critic via metagradient" (Wang & Ni, 2020), which applies a meta-gradient ap

Reviewer 02Rating 6Confidence 4

Strengths

- S1. [Presentation] First of all, this paper is well written and organized. - S2. [Novelty] The basic idea of learning LLM temperature for RL-based LLM post-training (i.e., meta-policy learning) seems novel.

Weaknesses

- W1. [Performance] For meta-policy learning, TAMPO additionally calculates likelihoods at virtual temperatures when training the policy model. This increases the computational complexity of the GRPO-based LLM post-training. However, compared to the basic GRPO-based post-training (pass@1: 42.6%), TAMPO provides a slightly higher average accuracy (pass@1: 44.5%). - W2. [Hyper-parameters] Even though this paper proposes to learn the temperature meta-policy, this may introduce additional hyper-pa

Reviewer 03Rating 4Confidence 3

Strengths

- This work provides a interesting way to control exploration in LLM RL by recasting temperature as a meta-policy variable. - TAMPO efficiently reuses existing rollouts for meta-policy updates, introducing negligible additional cost. - TAMPO demonstrates consistent improvements across multiple reasoning benchmarks using DeepSeek-R1-Distill-Qwen-1.5B.

Weaknesses

- Experiments are restricted to mathematical reasoning tasks $\to$ generalization to other domains (dialogue, code generation, ...) remains untested. - It’s unclear how TAMPO can be applied to critic-based or hybrid RLHF methods. - The approach uses a fixed discrete set of temperatures {0.6, 0.7, ..., 1.4, 1.5}. Continuous temperature optimization might yield smoother adaptation. - The paper does not include an ablation study for different base models. - While results are strong, interpretabilit

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Machine Learning and Data Classification