From Abstraction to Instantiation: Learning Behavioral Representation for Vision-Language-Action Model

Bing Hu; Zaijing Li; Rui Shao; Junda Chen; April Hua Liu; Wei-Shi Zheng; Liqiang Nie

arXiv:2605.22671·cs.CV·May 22, 2026

From Abstraction to Instantiation: Learning Behavioral Representation for Vision-Language-Action Model

Bing Hu, Zaijing Li, Rui Shao, Junda Chen, April Hua Liu, Wei-Shi Zheng, Liqiang Nie

PDF

TL;DR

BehaviorVLA introduces a novel framework for learning temporally coherent behavioral representations in vision-language-action models, improving robustness and generalization across diverse environments and tasks.

Contribution

The paper proposes BehaviorVLA, featuring a causal Mamba-based encoder and phase-conditioned decoder for better long-horizon behavior modeling.

Findings

01

Achieves state-of-the-art success rates on RoboTwin 2.0, LIBERO, and CALVIN datasets.

02

Matches OpenVLA-OFT performance in sim-to-real transfer with only 50% demonstration data.

03

Demonstrates improved data efficiency and generalization in complex scenarios.

Abstract

Vision-Language-Action (VLA) models often suffer from performance degradation under distribution shifts, as they struggle to learn generalized behavior representations across varying environments. While existing approaches attempt to construct behavior representations through action-centric latent variables, they are often limited by short-horizon temporal fragmentation and static execution-alignment, leading to inconsistent behaviors in complex scenarios. To address these limitations, we propose \textbf{BehaviorVLA}, a framework that facilitates robust manipulation through the learning of a temporally coherent behavioral representations. Our approach features two symmetric components: (1) the \textbf{Visuomotor Behavior Encoder (VBE)}, which utilizes a causal Mamba-based architecture to aggregate long-horizon trajectory information into a unified behavior representation; and (2) the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.