UpSkill: Mutual Information Skill Learning for Structured Response Diversity in LLMs
Devan Shah, Owen Yang, Daniel Yang, Chongyi Zheng, Benjamin Eysenbach

TL;DR
UpSkill introduces a novel training method for large language models that enhances response diversity and correctness across multiple attempts by optimizing mutual information, leading to improved performance on mathematical reasoning tasks.
Contribution
The paper proposes UpSkill, a new training approach that applies mutual information-based rewards within reinforcement learning to improve multi-attempt correctness in LLMs.
Findings
UpSkill improves pass@k metrics by approximately 3% on multiple models.
Mutual information optimization is empirically and theoretically linked to performance gains.
The method enhances response diversity without degrading single-attempt accuracy.
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has improved the reasoning abilities of large language models (LLMs) on mathematics and programming tasks, but standard approaches that optimize single-attempt accuracy can inadvertently suppress response diversity across repeated attempts, narrowing exploration and overlooking underrepresented strategies. We introduce UpSkill, a training time method that adapts Mutual Information Skill Learning (MISL) to LLMs for optimizing pass@k correctness. We propose a novel reward that we implement within Group Relative Policy Optimization (GRPO): a token-level mutual information (MI) reward that encourages trajectory specificity to z. Experiments on GSM8K with three open-weight models, Llama 3.1-8B, Qwen 2.5-7B, and R1-Distilled-Qwen2.5-Math-1.5B, show that UpSkill improves multi-attempt metrics on the stronger base models, yielding mean gains…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The paper targets a crucial problem in existing RL training: the trade-off between single-attempt accuracy and multi-attempt diversity. The authors proposed a novel adaptation of MISL, and the formulation of the token-level mutual information reward made a valuable contribution. 2. The proposed method is supported by theoretical analysis, which links the mutual information objective to a lower bound on pass@k improvement. 3. Experiments show that UpSkill is effective at improving both the sin
1. The scalability of UpSkill appears limited, as its experiments are confined to the GSM8K dataset with N=5, a relatively simple benchmark for mathematical reasoning. 2. The method's stability is questionable, as it shows inconsistent effects on the pass@1 metric across different arithmetic and GSM8K tasks. It also remains unclear whether UpSkill is effective when applied to more powerful base models.
This paper presents a comprehensive theoretical analysis aimed at establishing a connection between mutual information and the llm's pass@K bounds. By introducing token-level mutual information into the reward, the approach enhances the diversity of the model's output strategies, thereby increasing the success rate of Pass@K. The proposed idea is novel enough, and experimental results indicate that it effectively improves the model's Pass@K capability.
1. The experiment results of the paper are not convincing. First, the results across the two different experimental settings are inconsistent: in the first experiment, Pass@K improves but Pass@1 stays low, while in the second experiment, Pass@K improves without degrading Pass@1. This indicates that the results are highly sensitive to the environment/dataset, but the paper only tests this method on two environments/datasets, which makes the results lack robustness. Moreover, the experiments only
Basically, in my opinion, this paper's primary strength lies in its principled approach, tightly coupling the proposed method with a strong theoretical justification. The authors provide a formal proof (Lemma 1) that directly links the mutual information objective $\mathcal{I}(\tau;z|x)$ to a bound on pass@k improvements, moving the work beyond simple empirical heuristics. Moreover, the theoretical proof is well-supported by the empirical results. The "arithmetic environment" (Section 5.1, Fig
Basically, I think the experiments and analysis should be more sufficient and solid. 1) The experiment results for the Llama 3.1-8B model is decently negative and seems negative, where pass@k performance decreased by 2% (from 88% to 86%). The lack of discussion on this contradictory finding in the main text undermines the method's claimed robustness, especially on high-capability models. 2)The presentation of the main GSM8K results is incomplete. The specific value of $k$ used for the pass@k, p
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Reinforcement Learning in Robotics · Multimodal Machine Learning Applications
