Modular Sensory Stream for Integrating Physical Feedback in Vision-Language-Action Models
Jimin Lee, Huiwon Jang, Myungkyu Koo, Jungwoo Park, Jinwoo Shin

TL;DR
This paper introduces MoSS, a modular framework that enhances vision-language-action models by integrating multiple physical sensory signals like tactile and torque through a novel multi-modal attention mechanism.
Contribution
MoSS enables VLAs to incorporate heterogeneous physical signals using decoupled modality streams and a two-stage training scheme for stable, multi-sensory integration.
Findings
MoSS improves action prediction accuracy by leveraging multiple physical signals.
The framework demonstrates synergistic performance gains in real-world experiments.
Incorporating an auxiliary task enhances the modeling of contact interaction dynamics.
Abstract
Humans understand and interact with the real world by relying on diverse physical feedback beyond visual perception. Motivated by this, recent approaches attempt to incorporate physical sensory signals into Vision-Language-Action models (VLAs). However, they typically focus on a single type of physical signal, failing to capture the heterogeneous and complementary nature of real-world interactions. In this paper, we propose MoSS, a modular sensory stream framework that adapts VLAs to leverage multiple sensory signals for action prediction. Specifically, we introduce decoupled modality streams that integrate heterogeneous physical signals into the action stream via joint cross-modal self-attention. To enable stable incorporation of new modalities, we adopt a two-stage training scheme that freezes pretrained VLA parameters in the early stage. Furthermore, to better capture contact…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
