AlignIQL: Policy Alignment in Implicit Q-Learning through Constrained Optimization

Longxiang He; Li Shen; Xueqian Wang

arXiv:2405.18187·cs.LG·November 6, 2025

AlignIQL: Policy Alignment in Implicit Q-Learning through Constrained Optimization

Longxiang He, Li Shen, Xueqian Wang

PDF

Open Access 1 Repo 3 Reviews

TL;DR

AlignIQL introduces a novel optimization-based approach to explicitly recover policies in implicit Q-learning, enhancing offline RL performance especially in complex tasks by decoupling actor and critic.

Contribution

The paper proposes AlignIQL and AlignIQL-hard algorithms that explicitly solve the implicit policy-finding problem, improving policy extraction and performance over existing methods.

Findings

01

Achieves competitive or superior results on D4RL datasets.

02

Outperforms IQL and IDQL in complex sparse reward tasks.

03

Maintains simplicity of IQL while solving the implicit policy problem.

Abstract

Implicit Q-learning (IQL) serves as a strong baseline for offline RL, which learns the value function using only dataset actions through quantile regression. However, it is unclear how to recover the implicit policy from the learned implicit Q-function and why IQL can utilize weighted regression for policy extraction. IDQL reinterprets IQL as an actor-critic method and gets weights of implicit policy, however, this weight only holds for the optimal value function. In this work, we introduce a different way to solve the implicit policy-finding problem (IPF) by formulating this problem as an optimization problem. Based on this optimization problem, we further propose two practical algorithms AlignIQL and AlignIQL-hard, which inherit the advantages of decoupling actor from critic in IQL and provide insights into why IQL can use weighted regression for policy extraction. Compared with IQL…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 6Confidence 3

Strengths

1. The proposed method is derived rigorously. 2. The experiment shows that the proposed method has good empirical performance compared with other baselines on standard benchmarks.

Weaknesses

1. The formulation aims to use a general regularization function $f$, which is a good attempt. However, the remaining results seems to rely on the case that $f(x) = \log(x)$. Does the result generalize to any other regularization function? 2. Remark 5.7 seems very hand-wavy. How does the algorithm ensure that the action with the positive advantage is chosen? It does not seem to be reflected in the loss function. 3. While the result in table 1 looks impressive, I am not sure if this can serve a

Reviewer 02Rating 5Confidence 3

Strengths

- This paper introduces a new approach to tackle the implicit policy-finding problem, combining theoretical rigor with practical effectiveness in offline RL. - The proposed algorithm, AlignIQL, performs well across varied tasks, demonstrating versatility and effectiveness across different offline RL benchmarks.

Weaknesses

- While AlignIQL is rigorous, it adds complexity to training by requiring additional multiplier networks and diffusion models, which may increase computational costs and sensitivity to hyperparameters. The scalability of the method is also a concern; can it be extended to image-based tasks? - The authors do not explain the use of diffusion modeling in the methods section. - The performance of AlignIQL raises some concerns: - The authors argue that MuJoCo tasks are already saturated for offli

Reviewer 03Rating 5Confidence 3

Strengths

- The introduction of AlignIQL as a constrained optimization approach represents a significant advancement in offline reinforcement learning, providing a fresh perspective on implicit policy extraction. - The empirical results demonstrate that AlignIQL and its variant achieve competitive performance across a variety of D4RL benchmarks, particularly in challenging tasks with sparse rewards, indicating the effectiveness of the proposed methods. - Theoretical Insights: The paper offers valuable the

Weaknesses

- While the experiments demonstrate competitive performance on specific D4RL benchmarks, the applicability of AlignIQL to other domains or more diverse environments may not be fully established, limiting its generalizability. - The proposed framework may introduce additional complexity in implementation compared to existing methods, which could deter practitioners who seek simpler solutions for offline reinforcement learning. - Although the paper includes comparisons with several baseline method

Code & Models

Repositories

felix-thu/AlignIQL
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Stream Mining Techniques · Reinforcement Learning in Robotics · Elevator Systems and Control

MethodsImplicit Q-Learning · Q-Learning