Libra-VLA: Achieving Learning Equilibrium via Asynchronous Coarse-to-Fine Dual-System
Yifei Wei, Linqing Zhong, Yi Liu, Yuxiang Lu, Xindong He, Maoqing Yao, Guanghui Ren

TL;DR
Libra-VLA introduces a hierarchical, dual-system approach for vision-language-action models in robotics, improving learning efficiency and execution by decoupling coarse planning from fine control.
Contribution
This work proposes a novel coarse-to-fine dual-system architecture with asynchronous execution, explicitly modeling hierarchical actions to enhance robotic manipulation.
Findings
Performance peaks when action decomposition balances learning difficulty.
Asynchronous execution improves scalability and robustness.
Hierarchical modeling reduces semantic-actuation gap.
Abstract
Vision-Language-Action (VLA) models are a promising paradigm for generalist robotic manipulation by grounding high-level semantic instructions into executable physical actions. However, prevailing approaches typically adopt a monolithic generation paradigm, directly mapping visual-linguistic features to high-frequency motor commands in a flat, non-hierarchical fashion. This strategy overlooks the inherent hierarchy of robotic manipulation, where complex actions can be naturally modeled in a Hybrid Action Space, decomposing into discrete macro-directional reaching and continuous micro-pose alignment, severely widening the semantic-actuation gap and imposing a heavy representational burden on grounding high-level semantics to continuous actions. To address this, we introduce Libra-VLA, a novel Coarse-to-Fine Dual-System VLA architecture. We explicitly decouple the learning complexity into…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
