From Abstraction to Instantiation: Learning Behavioral Representation for Vision-Language-Action Model
Bing Hu, Zaijing Li, Rui Shao, Junda Chen, April Hua Liu, Wei-Shi Zheng, Liqiang Nie

TL;DR
BehaviorVLA introduces a novel framework for learning temporally coherent behavioral representations in vision-language-action models, improving robustness and generalization across diverse environments and tasks.
Contribution
The paper proposes BehaviorVLA, featuring a causal Mamba-based encoder and phase-conditioned decoder for better long-horizon behavior modeling.
Findings
Achieves state-of-the-art success rates on RoboTwin 2.0, LIBERO, and CALVIN datasets.
Matches OpenVLA-OFT performance in sim-to-real transfer with only 50% demonstration data.
Demonstrates improved data efficiency and generalization in complex scenarios.
Abstract
Vision-Language-Action (VLA) models often suffer from performance degradation under distribution shifts, as they struggle to learn generalized behavior representations across varying environments. While existing approaches attempt to construct behavior representations through action-centric latent variables, they are often limited by short-horizon temporal fragmentation and static execution-alignment, leading to inconsistent behaviors in complex scenarios. To address these limitations, we propose \textbf{BehaviorVLA}, a framework that facilitates robust manipulation through the learning of a temporally coherent behavioral representations. Our approach features two symmetric components: (1) the \textbf{Visuomotor Behavior Encoder (VBE)}, which utilizes a causal Mamba-based architecture to aggregate long-horizon trajectory information into a unified behavior representation; and (2) the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
