VITA: Vision-to-Action Flow Matching Policy
Dechen Gao, Boqi Zhao, Andrew Lee, Ian Chuang, Hanchu Zhou, Hang Wang, Zhe Zhao, Junshan Zhang, Iman Soltani

TL;DR
VITA introduces a noise-free, conditioning-free flow matching policy that directly maps visual inputs to actions, significantly reducing inference time while maintaining or improving performance across various tasks.
Contribution
The paper presents VITA, a novel framework that eliminates the need for visual conditioning and iterative denoising in flow matching policies by using an action autoencoder and flow latent decoding.
Findings
VITA achieves 1.5x-2x faster inference than traditional methods.
VITA outperforms or matches state-of-the-art policies on multiple tasks.
The approach effectively bridges vision and action in a unified, efficient framework.
Abstract
Conventional flow matching and diffusion-based policies sample via iterative denoising from standard noise distributions (e.g., Gaussian), and require conditioning modules to repeatedly incorporate visual information during the generative process, incurring substantial time and memory overhead. To reduce the complexity, we develop VITA, VIsion-To-Action policy, a noise-free and conditioning-free flow matching policy learning framework that directly flows from visual representations to latent actions. Since the source of the flow is visually grounded, VITA eliminates the need for visual conditioning during generation. As expected, bridging vision and action is challenging, because actions are lower-dimensional, less structured, and sparser than visual representations; moreover, flow matching requires the source and target to have the same dimensionality. To overcome this, we introduce an…
Peer Reviews
Decision·ICLR 2026 Poster
- I like the core idea of using the visual latent space as the source distribution for the flow matching process. Removing the explicit conditioning mechanisms (like cross-attention or FiLM) that are standard in diffusion/flow policies simplifies the architecture and naturally leads to faster inference. - The resulting architecture is lightweight. It is compelling that an MLP-only network for the flow matching and decoding components (following the ResNet encoder) can handle complex bimanual ta
- I find the claim of being "conditioning-free" (L014) slightly misleading. While explicit conditioning modules are removed, the flow is inherently conditioned on the visual input because the visual latent is the source distribution (z0). The velocity field must learn the transport from this specific starting point. This feels more like implicit conditioning via the ODE initial state rather than a fundamental removal of conditioning. - The approach seems heavily constrained by the architecture.
The idea of simplifying visuomotor policy learning by removing conditioning mechanisms is potentially interesting. The paper is generally clear in presentation, and experiments are competently executed. The implementation details are sufficiently documented, and the inclusion of ablation studies is appreciated.
The main conceptual motivation is underdeveloped. The authors state that removing conditioning simplifies the process, yet the argument remains superficial. It is not clear why direct flow from vision to actions should be advantageous, or what specific drawbacks the previous conditioning-based methods introduce. Without a stronger analysis, the contribution feels somewhat incremental. The use of MLP backbones further complicates interpretation of results. Introducing such a lightweight architec
- Paper proposes novel solution to the timely and important problem, which seems to be quite significant and useful to the researchers and practitioners. - Paper itself is cleanly written, explaining the motivation behind the method ingredients. All claims are supported by evidence and additional ablations are provided to further support the proposed contributions. VITA performs on par with the baselines, while being much faster on inference, which is essential for real world applications. Pap
I think the paper has two weaknesses. Firstly, it seems to me that the motivation is not sufficiently explained. After all, why is it important to flow directly from images into actions, rather than from noise with visual conditioning? Why does it introduces complexity? Where does the additional overhead come from? Given that this is not analysed further in the paper, I believe that a simple citation (e.g. line 55) of prior work is insufficient. I advise the authors to elaborate on their reaso
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSimulation Techniques and Applications
