MA-RLHF: Reinforcement Learning from Human Feedback with Macro Actions
Yekun Chai, Haoran Sun, Huang Fang, Shuohuan Wang, Yu Sun, Hua Wu

TL;DR
MA-RLHF introduces macro actions into reinforcement learning from human feedback, enabling higher-level language constructs to improve learning speed, stability, and performance across multiple NLP tasks without added computational costs.
Contribution
The paper presents a novel macro actions framework for RLHF that enhances credit assignment and training efficiency in large language models.
Findings
Achieves up to 30% performance improvement in text summarization and code generation.
Reaches parity with standard RLHF 1.7 to 2 times faster in training.
Demonstrates effectiveness across various NLP tasks and model sizes.
Abstract
Reinforcement learning from human feedback (RLHF) has demonstrated effectiveness in aligning large language models (LLMs) with human preferences. However, token-level RLHF suffers from the credit assignment problem over long sequences, where delayed rewards make it challenging for the model to discern which actions contributed to preferred outcomes. This hinders learning efficiency and slows convergence.In this paper, we propose MA-RLHF, a simple yet effective RLHF framework that incorporates macro actions -- sequences of tokens or higher-level language constructs -- into the learning process. By operating at higher level of abstraction, our approach reduces the temporal distance between actions and rewards, facilitating faster and more accurate credit assignment. This results in more stable policy gradient estimates and enhances learning efficiency within each episode, all without…
Peer Reviews
Decision·ICLR 2025 Poster
(S1) The direction of the current work to use macro actions (or a group of low-level actions together) is novel and interesting, for the setting of aligning a language model with human preferences (RLHF scope). Please see W1 for more discussion. (S2) The manuscript is well-written, with sufficient details on all the technical details making it easy to understand and comprehend. I thoroughly enjoyed reading the paper. (S3) The ablation and generalization studies presented in the experiment sect
(W1) The idea of gathering multiple actions together as a “macro-action” is not novel in general. Referred to as “action chunking” [A], similar ideas have been explored in other domains. As noted in (S1), this is still useful in the context of RLHF, but with a diminished novelty factor. (W2) The current work mostly experiments with a single LLM (Gemma and its variants). Given (W1) and lack of generalizability across different models, the usefulness of the current approach over PPO is not clearl
* They conduct thorough analysis on many different tasks, using several different sized base models. * They conduct many ablations of their method. * The paper is clearly written and easy to follow. * This empirical analysis of macro actions present a useful contribution to our understanding of RLHF training.
* Some of the differences in performance could be due to tuning one method more than another. It would be great if the paper could at least document the steps they went through to tune hyper-parameters for their method and baselines. * My understanding is that the primary functional difference between per-token PPO and macro action PPO is the granularity of the value function, importance sampling, and discount factor. It is a fairly small modification, but this is not necessarily clear when read
- The paper identifies a critical limitation in token-level RLHF - the credit assignment problem over long sequences - and proposes a simple yet effctive solution by incorporating macro actions. The approach is well-motivated by both theoretical considerations (credit assignment, temporal abstraction) and practical issues (subword tokenization challenges). - The paper is well-written and easy to follow. - The experimental evaluation is thorough, covering multiple tasks, model sizes (2B to 27B
- While the paper explores different termination conditions for macro actions (n-gram based, perplexity based, parsing based), there could be more analysis of how different types of macro actions affect different types of tasks. For example, which macro action strategies work best for which types of generation tasks? For example, i would expect parsing-based termination might also work well on code generation tasks if a programming-language-based parser was used. - Lack details of Human Evaluat
Code & Models
- 🤗ernie-research/APPS-Gemma-2B-MA-PPO-Fixed10model· 17 dl17 dl
- 🤗ernie-research/APPS-Gemma-7B-MA-PPO-Fixed10model· 4 dl4 dl
- 🤗ernie-research/HH-RLHF-Gemma-2B-MA-PPO-Fixed5model· 10 dl10 dl
- 🤗ernie-research/HH-RLHF-Gemma-7B-MA-PPO-Fixed5model· 3 dl3 dl
- 🤗ernie-research/TLDR-Gemma-2-27B-MA-PPO-Fixed5model· 14 dl14 dl
- 🤗ernie-research/TLDR-Gemma-2B-MA-PPO-Fixed5model· 4 dl· ♡ 14 dl♡ 1
- 🤗ernie-research/TLDR-Gemma-7B-MA-PPO-Fixed5model· 2 dl2 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Reinforcement Learning in Robotics · Evolutionary Algorithms and Applications
