ReFORM: Reflected Flows for On-support Offline RL via Noise Manipulation
Songyuan Zhang, Oswin So, H. M. Sabbir Ahmad, Eric Yang Yu, Matthew Cleaveland, Mitchell Black, Chuchu Fan

TL;DR
ReFORM introduces a flow-based offline RL method that maintains support constraints through noise manipulation, effectively reducing out-of-distribution errors and outperforming baselines across diverse tasks.
Contribution
The paper presents ReFORM, a novel flow policy approach that enforces support constraints by design, improving offline RL performance without extensive hyperparameter tuning.
Findings
ReFORM outperforms all baselines on 40 OGBench tasks.
ReFORM maintains support constraints effectively.
ReFORM achieves superior performance with a fixed hyperparameter set.
Abstract
Offline reinforcement learning (RL) aims to learn the optimal policy from a fixed dataset generated by behavior policies without additional environment interactions. One common challenge that arises in this setting is the out-of-distribution (OOD) error, which occurs when the policy leaves the training distribution. Prior methods penalize a statistical distance term to keep the policy close to the behavior policy, but this constrains policy improvement and may not completely prevent OOD actions. Another challenge is that the optimal policy distribution can be multimodal and difficult to represent. Recent works apply diffusion or flow policies to address this problem, but it is unclear how to avoid OOD errors while retaining policy expressiveness. We propose ReFORM, an offline RL method based on flow policies that enforces the less restrictive support constraint by construction. ReFORM…
Peer Reviews
Decision·ICLR 2026 Poster
* The OOD issue, as well as the distribution of optimal action policy, are classic topics in offline RL, it is appreciated that the authors consider these issues from the new perspectives. * The use of bounded source distribution and reflect flow is quite novel and appealing. It fundamentally avoids OOD actions being explored. * The reflected flow noise generator can produce complex multimodal noise, which is helpful for some scenarios where real actions distribution are quite complex.
1. Some related references are missing, and it is suggested to consider the related work in the manuscript. * https://arxiv.org/abs/2202.06239 * https://arxiv.org/abs/1705.08868 * https://arxiv.org/abs/2301.12130 2. The model design is appealing, however, the performance of the model also relies on the quality of behavior cloning model. How is the model performance if the BC model is not well estimated. Does the author consider about the robustness of the proposed method? 3. Although the
The paper presents strong empirical results and thorough ablations justifying the design choices. In particular, ReFORM achieves strong performance across a variety of environments and tasks while using the same set hyperparameters, which is uncommon for offline RL algorithms.
The paper does not analyze potential reasons for why ReFORM outperforms the baselines in certain environments and datasets (clean vs noisy) but not others. The paper does not investigate how this approach may scale to higher-dimensional state-action spaces. Does ReFORM’s approach of constraining to the data generating policy’s support work in a higher-dimensional space such as image-based inputs? The paper does not compare to state-of-the-art algorithms on OGBench such as SORL [1] and floq [2]
Originality: ReFORM introduces a conceptually fresh idea—noise reflection for on-support control—that differs from prior regularization-based approaches. The theoretical analysis clarifies when KL or Wasserstein constraints fail to guarantee support inclusion, which is insightful. Technical Quality: The paper provides rigorous derivations and clear theorems connecting reflection dynamics to support preservation. The algorithmic design (bounded latent + reflected flow + distillation) is internal
Limited Empirical Scope: Experiments are confined to OGBench; no evaluation on D4RL, Adroit, or visual RL tasks. The claim of “state-of-the-art performance” is weakened by the absence of comparison with recent strong baselines such as A2PR (arXiv 2405.19909), XQL, and EDQL. Lacking computational cost analysis—reflection dynamics and distillation likely add overhead. Moderate Empirical Depth: Ablation results (λ, number of clusters, reflection strength) are only in the appendix; some should app
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis
