Astra: Efficient Transformer Architecture and Contrastive Dynamics Learning for Embodied Instruction Following
Yueen Ma, Dafeng Chi, Shiguang Wu, Yuecheng Liu, Yuzheng Zhuang, Irwin King

TL;DR
Astra introduces a novel Transformer architecture with trajectory attention and contrastive dynamics learning, significantly improving embodied instruction following in robot manipulation tasks by better processing segmented multimodal sequences.
Contribution
The paper presents Astra, a new Transformer model with trajectory attention and learnable action queries, and a contrastive dynamics learning objective, advancing multimodal sequence processing in embodied AI.
Findings
Astra outperforms previous models on three robot manipulation benchmarks.
Trajectory attention improves processing of segmented multimodal sequences.
Contrastive dynamics learning enhances environment understanding and modality alignment.
Abstract
Vision-language-action models have gained significant attention for their ability to model multimodal sequences in embodied instruction following tasks. However, most existing models rely on causal attention, which we find suboptimal for processing sequences composed of interleaved segments from different modalities. In this paper, we introduce Astra, a novel Transformer architecture featuring trajectory attention and learnable action queries, designed to efficiently process segmented multimodal trajectories and predict actions for imitation learning. Furthermore, we propose a contrastive dynamics learning objective to enhance the model's understanding of environment dynamics and multimodal alignment, complementing the primary behavior cloning objective. Through extensive experiments on three large-scale robot manipulation benchmarks, Astra demonstrates substantial performance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Robotics and Automated Systems
