Learning to Align, Aligning to Learn: A Unified Approach for Self-Optimized Alignment
Haowen Wang, Yun Yue, Zhiling Ye, Shuowen Zhang, Lei Fan, Jiaxin Liang, Jiadi Jiang, Cheng Wei, Jingyuan Deng, Xudong Han, Ji Li, Chunxiao Guo, Peng Wei, Jian Wang, Jinjie Gu

TL;DR
This paper introduces GRAO, a unified framework combining supervised fine-tuning and reinforcement learning to improve language model alignment, achieving significant performance gains and theoretical guarantees.
Contribution
GRAO is a novel unified approach that integrates SFT and RL for better alignment, with new strategies for sample generation, loss formulation, and parameter updates.
Findings
GRAO outperforms SFT, DPO, PPO, and GRPO baselines in alignment tasks.
Theoretical analysis confirms GRAO's convergence and sample efficiency.
Empirical results show substantial relative improvements in alignment performance.
Abstract
Alignment methodologies have emerged as a critical pathway for enhancing language model alignment capabilities. While SFT (supervised fine-tuning) accelerates convergence through direct token-level loss intervention, its efficacy is constrained by offline policy trajectory. In contrast, RL(reinforcement learning) facilitates exploratory policy optimization, but suffers from low sample efficiency and stringent dependency on high-quality base models. To address these dual challenges, we propose GRAO (Group Relative Alignment Optimization), a unified framework that synergizes the respective strengths of SFT and RL through three key innovations: 1) A multi-sample generation strategy enabling comparative quality assessment via reward feedback; 2) A novel Group Direct Alignment Loss formulation leveraging intra-group relative advantage weighting; 3) Reference-aware parameter updates guided by…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- Studying whether models can be trained to improve the alignment of other models is an important research question, particularly given the increasing emphasis on automated safety supervision and scalable oversight. - The paper attempts to provide a mathematical formulation, which, if clarified and strengthened, could contribute to the emerging space of alignment studies.
- The framing (“alignment for alignment capability”) is vague and not presented in a way that builds intuition. - It is unclear what concrete capability the paper is targeting or how success should be measured. - The introduction does not present a crisp research question or conceptual insight. - Contributions are not enumerated cleanly, and the storyline does not reveal what the reader should take away. - Several equations do not follow standard conventions from related alignment and bileve
1. The methodology is illustrated clearly. 2. The convergence of the proposed optimization objective is proved theoretically. 3. The experiments cover various baselines and include more than one LLM family.
1. The hyperparameters $\beta$ and $\lambda$ are important. They balance exploration and imitation and control regularization strength. However, despite the default value, the author did not provide more information on these. Some ablation study is needed here for understanding how the algorithm synergizes SFT and RL. 2. The figures for the experimental results are not fine-grained. The vector format should be used in the final version of paper. 3. The theoretical bound in Equation 4 is not deep
- The approach shows an unique way of comibing SFT and RL with imitiation learning and exploration type losses - Empirical analysis is promising showing improvement over baselines which shows faster convergence over baselines - Works well on both normal (dense) and mixture-of-experts (MoE) models
- The key weakness is explainly clearly where the benefit is coming from? I appreciate the author showing the Table 3 the ablation on removing several components, but its not clear which is the key component? For example, main contributio comes from imitiation data or the dense feedback? Its not clear why training separately is an issue and additive helps? - How good is the performance with first SFT and then RL with varying beta? Where is the comparison of KL which makes it unclear. Since, wh
No Strengths
#### **1. Incomplete and Ambiguous Objective Formulation** - **Equation (2)** (Line 215) defines an expectation but omits the inner function $f(X)$, making the expression incomplete. - **Equation (3)** (Line 219) references advantage terms $\hat{A}_{o_i}$ and $\hat{A}_y$ without defining the reward function $R(o_i, y)$ (Line 250). - Several variables remain undefined throughout the paper: - $o_i$ in Eq. (5) - $o_{\text{pre}, i}$, $o_{\text{post}, i}$ in Eq. (6) - These omissions make
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Domain Adaptation and Few-Shot Learning
