Scalable Exploration for High-Dimensional Continuous Control via Value-Guided Flow

Yunyue Wei; Chenhui Zuo; Yanan Sui

arXiv:2601.19707·cs.LG·January 28, 2026

Scalable Exploration for High-Dimensional Continuous Control via Value-Guided Flow

Yunyue Wei, Chenhui Zuo, Yanan Sui

PDF

Open Access 3 Reviews

TL;DR

Qflex introduces a scalable, value-guided exploration method for high-dimensional continuous control, significantly improving performance and sample efficiency in complex robotic tasks by aligning exploration with task-relevant gradients.

Contribution

The paper presents Qflex, a novel exploration approach that operates directly in high-dimensional action spaces using value-guided flows, overcoming limitations of traditional undirected exploration methods.

Findings

01

Qflex outperforms baseline RL methods on high-dimensional benchmarks.

02

Qflex effectively controls a full-body human musculoskeletal model.

03

Qflex demonstrates superior scalability and sample efficiency.

Abstract

Controlling high-dimensional systems in biological and robotic applications is challenging due to expansive state-action spaces, where effective exploration is critical. Commonly used exploration strategies in reinforcement learning are largely undirected with sharp degradation as action dimensionality grows. Many existing methods resort to dimensionality reduction, which constrains policy expressiveness and forfeits system flexibility. We introduce Q-guided Flow Exploration (Qflex), a scalable reinforcement learning method that conducts exploration directly in the native high-dimensional action space. During training, Qflex traverses actions from a learnable source distribution along a probability flow induced by the learned value function, aligning exploration with task-relevant gradients rather than isotropic noise. Our proposed method substantially outperforms representative online…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

1. The paper presents clear and concise expression. 2. The paper presents thorough comparative experiments, testing QFLEX on six benchmark tasks and comparing it with eight baseline methods. 3. The proposed QFLEX method efficiently addresses exploration in high-dimensional action spaces, overcoming the inefficiencies of undirected exploration.

Weaknesses

1. Although QFLEX performs well in simulated tasks, the paper does not provide sufficient validation on real-world systems.

Reviewer 02Rating 8Confidence 3

Strengths

The idea is well-motivated, clearly established, and novel. The problem it addresses is central to scaling reinforcement learning policies to more complex and realistic control tasks. Leveraging Q-guided flows to generate actions for directed exploration is both natural and insightful, with its benefits demonstrated across several high-dimensional tasks that are typically challenging for RL policies to learn from scratch. Moreover, the method introduces only minimal modifications to a standard a

Weaknesses

The primary weakness of the proposed method lies in its strong reliance on accurate Q-gradient estimation. During early training stages, or in scenarios where Q-gradients are unreliable, the resulting action updates may misguide exploration and ultimately hinder policy improvement. Although the paper discusses the use of batch normalization to stabilize Q-learning, a more thorough analysis of potential failure cases or mitigation strategies would further strengthen the work.

Reviewer 03Rating 4Confidence 4

Strengths

- The idea of aligning exploration with the value function through a flow-based transformation is well grounded and connects flow matching with reinforcement learning in a principled way. - The method achieves state-of-the-art performance on several high-dimensional control benchmarks, including tasks with >700 actuators. - Qflex integrates easily into standard actor-critic frameworks such as SAC, without architectural modifications.

Weaknesses

- While applying flow matching to online RL for exploration is interesting, closely related ideas appear in [1] and [2]. The main novelty lies in scaling to high-dimensional systems rather than a fundamentally new formulation. - The paper does not analyze how value-guided flows improve exploration or in which regimes they outperform Gaussian or latent-space baselines. - The qualitative figure contrasting Gaussian vs. Qflex exploration is illustrative but lacks quantitative evaluation or a contro

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Muscle activation and electromyography studies · Model Reduction and Neural Networks