Rethinking Reinforcement fine-tuning of LLMs: A Multi-armed Bandit Learning Perspective
Xiao Hu, Hong Xie, Tao Tan, Defu Lian, Jianyu Han

TL;DR
This paper investigates the reinforcement fine-tuning of large language models by systematically analyzing the impact of different design choices through a bottom-up experimental pipeline inspired by multi-armed bandit theory.
Contribution
It introduces a minimalist experimental framework that isolates the effects of individual fine-tuning components, providing new insights into their roles and bottlenecks.
Findings
Identifies key factors influencing fine-tuning effectiveness.
Reveals bottlenecks in current reinforcement learning approaches.
Provides theoretical and empirical guidance for better fine-tuning strategies.
Abstract
A large number of heuristics have been proposed to optimize the reinforcement fine-tuning of LLMs. However, inconsistent claims are made from time to time, making this area elusive. Reflecting on this situation, two fundamental questions still lack a clear understanding: 1) what is the role of each optimizing choice? 2) which ones are the bottlenecks? This paper aims to shed light on them, and it faces the challenge of several entangled confounding factors in the fine-tuning process. To tackle this challenge, we propose a bottom-up experiment pipeline. The bottom layer is composed of a minimalist configuration: one training data, one rollout per round and the reward directly serve as the learning signal without advantage function design. This minimalist configuration connects to multi-armed bandit learning with extremely large discrete action space, which offers theories to corroborate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Multi-Objective Optimization Algorithms · Advanced Bandit Algorithms Research
