Flow Actor-Critic for Offline Reinforcement Learning

Jongseong Chae; Jongeui Park; Yongjae Shin; Gyeongmin Kim; Seungyul Han; Youngchul Sung

arXiv:2602.18015·cs.LG·February 23, 2026

Flow Actor-Critic for Offline Reinforcement Learning

Jongseong Chae, Jongeui Park, Yongjae Shin, Gyeongmin Kim, Seungyul Han, Youngchul Sung

PDF

Open Access 3 Reviews

TL;DR

Flow Actor-Critic introduces a flow-based policy and critic approach for offline reinforcement learning, effectively capturing complex data distributions and achieving state-of-the-art results on benchmark datasets.

Contribution

The paper presents a novel flow-based actor-critic method that uses expressive flow models for both policy and conservative critic in offline RL, enhancing performance on complex datasets.

Findings

01

Achieves state-of-the-art results on D4RL and OGBench benchmarks.

02

Effectively models complex and multi-modal data distributions.

03

Prevents Q-value explosion in out-of-data regions.

Abstract

The dataset distributions in offline reinforcement learning (RL) often exhibit complex and multi-modal distributions, necessitating expressive policies to capture such distributions beyond widely-used Gaussian policies. To handle such complex and multi-modal datasets, in this paper, we propose Flow Actor-Critic, a new actor-critic method for offline RL, based on recent flow policies. The proposed method not only uses the flow model for actor as in previous flow policies but also exploits the expressive flow model for conservative critic acquisition to prevent Q-value explosion in out-of-data regions. To this end, we propose a new form of critic regularizer based on the flow behavior proxy model obtained as a byproduct of flow-based actor design. Leveraging the flow model in this joint way, we achieve new state-of-the-art performance for test datasets of offline RL including the D4RL and…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

- The preliminary experiments strongly show that flow-matching can better estimate the density of the behavior policy. - The proposed method is intuitive and seems to be a nice approach to incorporate flow-matching to offline RL. - The proposed method shows strong performance not only on the D4RL benchmarks but also on the more recent OGBench benchmarks.

Weaknesses

- Since the proposed method relies heavily on the accuracy of the density estimation from flow matching, there is some possibility that the proposed method may not scale well to higher-dimensional environments like pixel-based ones. - Importantly, the proposed method introduces many new hyperparameters (alpha, lambda, clipped double Q-learning, and epsilon), and those hyperparameters are tuned for each task. This is not a fair comparison with the baselines. For example, CQL and IQL use the same

Reviewer 02Rating 6Confidence 3

Strengths

Originality: The paper introduces a novel and coherent idea — using the same flow model both for actor regularization and critic penalization. This dual use of flow behavior density provides a principled way to directly detect OOD regions, addressing a key challenge in offline RL. Technical quality: The method is conceptually sound and the derivation of the flow-based critic operator is clear. The integration of density-weighted Q penalization and flow-regularized actor optimization is well mot

Weaknesses

Incremental over FQL: Conceptually, the work extends FQL by reusing the flow density for critic penalization rather than introducing an entirely new framework. Although the empirical gains are notable, the conceptual advance may be viewed as incremental. Computational cost: Flow-based models are typically heavier than Gaussian or VAE-based policies, but the paper does not discuss computational trade-offs.

Reviewer 03Rating 8Confidence 4

Strengths

The paper is clearly written and well structured. The motivation is crisp, the method is introduced in a logical sequence, and the experiments are laid out so the key results are easy to grasp before diving into details. The core idea is clean and elegant. By using the behavior model’s density to decide when to be conservative, the critic only gets pushed down where actions look out-of-distribution and is left alone where the data provide strong support. Coupling this with a one-step flow acto

Weaknesses

I don’t see any major weaknesses. The only area that feels under-explored is the threshold design used to decide when the behavior density is “low.” The paper offers two options—a dataset-wide constant and a batch-adaptive threshold—but stops short of examining how this choice affects robustness across tasks. It would strengthen the work to add a small, focused study on the threshold itself: for example, trying simple variants like scaling the base threshold by a weight; using per-batch quantile

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Adversarial Robustness in Machine Learning · Domain Adaptation and Few-Shot Learning