GoldenStart: Q-Guided Priors and Entropy Control for Distilling Flow Policies
He Zhang, Ying Sun, Hui Xiong

TL;DR
GoldenStart (GSFlow) introduces Q-guided priors and entropy control to improve flow-matching policy distillation, enabling efficient, exploratory, and high-performance reinforcement learning policies with reduced inference latency.
Contribution
The paper proposes GSFlow, a novel policy distillation method using Q-guided priors and explicit entropy regulation to enhance exploration and inference efficiency in flow-based RL policies.
Findings
Outperforms prior state-of-the-art methods on continuous control benchmarks.
Effectively balances exploration and exploitation through entropy regularization.
Achieves faster inference with high-quality policy representations.
Abstract
Flow-matching policies hold great promise for reinforcement learning (RL) by capturing complex, multi-modal action distributions. However, their practical application is often hindered by prohibitive inference latency and ineffective online exploration. Although recent works have employed one-step distillation for fast inference, the structure of the initial noise distribution remains an overlooked factor that presents significant untapped potential. This overlooked factor, along with the challenge of controlling policy stochasticity, constitutes two critical areas for advancing distilled flow-matching policies. To overcome these limitations, we propose GoldenStart (GSFlow), a policy distillation method with Q-guided priors and explicit entropy control. Instead of initializing generation from uninformed noise, we introduce a Q-guided prior modeled by a conditional VAE. This…
Peer Reviews
Decision·ICLR 2026 Poster
The paper precisely identifies two previously overlooked weaknesses in existing one-step distilled flow policies — the use of an uninformed Gaussian prior and the absence of controllable stochasticity — and builds upon them to formulate a coherent research question and motivation. The work introduces a conditional VAE that learns a state-conditioned high-Q prior, transforming the random starting point in generative inference into a structured, value-aligned initialization, thereby providing an
1. The Advantage Noise Selection module relies on a hard argmax over Q-values to identify the optimal noise per state. While simple, this approach may amplify critic bias and reduce diversity in the learned prior. More robust alternatives such as soft advantage weighting or top-k filtering could potentially mitigate this brittleness. 2. While GS-flow demonstrates impressive efficiency within the family of flow-matching methods (notably compared to FQL and IFQL), it remains unclear whether this
- The clarity of the paper, the explanations, the contributions and evidence supporting them - The introduction of the crescent toy task with the 2D visualisations, as well as the ablation studies in the experiments for crescent for each of the contributions (prior / entropy), which helps in understanding, - The simplicity of the introduced changes - The exhaustive set of experiments (as in FQL)
- The additional computational cost for learning the VAE prior distribution
The idea of Golden priors is an interesting one, and there are not many instantiations of it. As this paper points out, directly optimizing flow-matching policies using RL has many challenges that this method avoids. Advantage Noise Selection is a reasonable method for creating higher-valued noise targets. The experimental results are thorough, and the crescent experiment is a very good demonstration of the method doing what it claims. The method achieves quite good performance on the extensive
I would describe the instantiation of Advantage Noise Selection as amortizing the rejection-sampling process. I think a stronger argument could have been made for, and more time spent on, why we expect an advantage of this method over rejection sampling (IFQL is the included reference for this) — why does this lead to better performance when it is doing something pretty similar? The 2x speedup I think is not strength enough alone given the added complexity. The entropy regularization is empiric
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Adversarial Robustness in Machine Learning · Generative Adversarial Networks and Image Synthesis
