GoldenStart: Q-Guided Priors and Entropy Control for Distilling Flow Policies

He Zhang; Ying Sun; Hui Xiong

arXiv:2603.14245·cs.LG·March 17, 2026

GoldenStart: Q-Guided Priors and Entropy Control for Distilling Flow Policies

He Zhang, Ying Sun, Hui Xiong

PDF

Open Access 3 Reviews

TL;DR

GoldenStart (GSFlow) introduces Q-guided priors and entropy control to improve flow-matching policy distillation, enabling efficient, exploratory, and high-performance reinforcement learning policies with reduced inference latency.

Contribution

The paper proposes GSFlow, a novel policy distillation method using Q-guided priors and explicit entropy regulation to enhance exploration and inference efficiency in flow-based RL policies.

Findings

01

Outperforms prior state-of-the-art methods on continuous control benchmarks.

02

Effectively balances exploration and exploitation through entropy regularization.

03

Achieves faster inference with high-quality policy representations.

Abstract

Flow-matching policies hold great promise for reinforcement learning (RL) by capturing complex, multi-modal action distributions. However, their practical application is often hindered by prohibitive inference latency and ineffective online exploration. Although recent works have employed one-step distillation for fast inference, the structure of the initial noise distribution remains an overlooked factor that presents significant untapped potential. This overlooked factor, along with the challenge of controlling policy stochasticity, constitutes two critical areas for advancing distilled flow-matching policies. To overcome these limitations, we propose GoldenStart (GSFlow), a policy distillation method with Q-guided priors and explicit entropy control. Instead of initializing generation from uninformed noise, we introduce a Q-guided prior modeled by a conditional VAE. This…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 3

Strengths

The paper precisely identifies two previously overlooked weaknesses in existing one-step distilled flow policies — the use of an uninformed Gaussian prior and the absence of controllable stochasticity — and builds upon them to formulate a coherent research question and motivation. The work introduces a conditional VAE that learns a state-conditioned high-Q prior, transforming the random starting point in generative inference into a structured, value-aligned initialization, thereby providing an

Weaknesses

1. The Advantage Noise Selection module relies on a hard argmax over Q-values to identify the optimal noise per state. While simple, this approach may amplify critic bias and reduce diversity in the learned prior. More robust alternatives such as soft advantage weighting or top-k filtering could potentially mitigate this brittleness. 2. While GS-flow demonstrates impressive efficiency within the family of flow-matching methods (notably compared to FQL and IFQL), it remains unclear whether this

Reviewer 02Rating 8Confidence 3

Strengths

- The clarity of the paper, the explanations, the contributions and evidence supporting them - The introduction of the crescent toy task with the 2D visualisations, as well as the ablation studies in the experiments for crescent for each of the contributions (prior / entropy), which helps in understanding, - The simplicity of the introduced changes - The exhaustive set of experiments (as in FQL)

Weaknesses

- The additional computational cost for learning the VAE prior distribution

Reviewer 03Rating 4Confidence 3

Strengths

The idea of Golden priors is an interesting one, and there are not many instantiations of it. As this paper points out, directly optimizing flow-matching policies using RL has many challenges that this method avoids. Advantage Noise Selection is a reasonable method for creating higher-valued noise targets. The experimental results are thorough, and the crescent experiment is a very good demonstration of the method doing what it claims. The method achieves quite good performance on the extensive

Weaknesses

I would describe the instantiation of Advantage Noise Selection as amortizing the rejection-sampling process. I think a stronger argument could have been made for, and more time spent on, why we expect an advantage of this method over rejection sampling (IFQL is the included reference for this) — why does this lead to better performance when it is doing something pretty similar? The 2x speedup I think is not strength enough alone given the added complexity. The entropy regularization is empiric

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Adversarial Robustness in Machine Learning · Generative Adversarial Networks and Image Synthesis