Aligning Frozen LLMs by Reinforcement Learning: An Iterative Reweight-then-Optimize Approach

Xinnan Zhang; Chenliang Li; Siliang Zeng; Jiaxiang Li; Zhongruo Wang; Kaixiang Lin; Songtao Lu; Alfredo Garcia; Mingyi Hong

arXiv:2506.17828·cs.LG·July 4, 2025

Aligning Frozen LLMs by Reinforcement Learning: An Iterative Reweight-then-Optimize Approach

Xinnan Zhang, Chenliang Li, Siliang Zeng, Jiaxiang Li, Zhongruo Wang, Kaixiang Lin, Songtao Lu, Alfredo Garcia, Mingyi Hong

PDF

3 Reviews

TL;DR

This paper introduces IRO, a reinforcement learning framework that aligns frozen large language models with human preferences through an iterative reweighting and optimization process, avoiding direct weight updates.

Contribution

The method enables alignment of large language models without modifying their weights, using a novel iterative reweight-then-optimize approach with lightweight value functions.

Findings

01

IRO effectively aligns frozen LLMs with human preferences.

02

The approach achieves comparable performance to traditional fine-tuning methods.

03

It reduces the need for access to model weights and lowers computational costs.

Abstract

Aligning large language models (LLMs) with human preferences usually requires fine-tuning methods such as RLHF and DPO. These methods directly optimize the model parameters, so they cannot be used in test-time to improve model performance, nor are they applicable when the model weights are not accessible. In contrast, test-time methods sidestep weight updates by leveraging reward functions to guide and improve output quality. However, they incur high inference costs, and their one-shot guidance is often based on imperfect reward or value functions, leading to suboptimal outputs. In this work, we present a method named Iterative Reweight-then-Optimize (IRO), a reinforcement learning (RL) framework that performs RL-style alignment of the (frozen) base model without touching its parameters. During training, each iteration (i) samples candidates from the base model, (ii) resamples using…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

1. Introduces a novel approach for reweighting self-generated data to facilitate successive policy improvement. 2. Provides thorough theoretical analysis, considering convergence and efficiency 3. Provides insightful ablation studies, including β selection, chunk length, data volume, and reward model quality.

Weaknesses

1. The article devotes a considerable amount of space to demonstrating the rationality of the method, but provides limited explanation of its operational process. 2. There may be overfitting or cumulative bias in multi-round iterative training of the value function, and the paper lacks quantitative analysis for this. 3. While token efficiency is theoretically improved, runtime latency from multiple value evaluations and beam search is not quantified. For real-time systems, this trade-off is e

Reviewer 02Rating 4Confidence 4

Strengths

- The exposition of the idea is generally clear. - Steering a large model without modifying its weight is an important problem. - Theoretical claims seem valid.

Weaknesses

- The primary concern is the computational overhead of the proposed method. - IRO may require significant training compute, and, furthermore, non-trivial inference-time memory and latency overhead. Please correct me if I am wrong. - For example, if we use the 7B value model and apply IRO for 3 iterations, does this mean that we need to run the 7B model three times during the inference, and the number of parameters being trained is 21B? - Discussion on the connection to Blockwise Best-of-

Reviewer 03Rating 6Confidence 4

Strengths

1. High Significance and Novelty: The problem of aligning frozen or "black-box" models is extremely relevant. The core idea of using an iterative RL process to learn a sequence of value functions—effectively performing policy iteration without requiring weight updates—is a novel and powerful concept that significantly extends beyond existing one-shot inference methods. 2. Strong Theoretical Foundation: The method is not just a heuristic. It is well-grounded in RL theory, with clear connections d

Weaknesses

1. Training Cost Analysis: The paper heavily focuses on the inference-time efficiency gains over BoN. However, the IRO framework requires an iterative training process: $T$ iterations of generating a full dataset and training a (lightweight) value function. This "offline" training cost is not trivial and is not thoroughly compared against the one-time training cost of methods like DPO or the pure (but massive) inference cost of BoN with a very large $N$. A "total compute" comparison would make t

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.