VTAM: Video-Tactile-Action Models for Complex Physical Interaction Beyond VLAs

Haoran Yuan; Weigang Yi; Zhenyu Zhang; Wendi Chen; Yuchen Mo; Jiashi Yin; Xinzhuo Li; Xiangyu Zeng; Chuan Wen; Cewu Lu; Katherine Driggs-Campbell; Ismini Lourentzou

arXiv:2603.23481·cs.RO·March 25, 2026

VTAM: Video-Tactile-Action Models for Complex Physical Interaction Beyond VLAs

Haoran Yuan, Weigang Yi, Zhenyu Zhang, Wendi Chen, Yuchen Mo, Jiashi Yin, Xinzhuo Li, Xiangyu Zeng, Chuan Wen, Cewu Lu, Katherine Driggs-Campbell, Ismini Lourentzou

PDF

Open Access

TL;DR

VTAM introduces a multimodal framework combining video and tactile data to improve physical interaction modeling, especially in contact-rich scenarios, surpassing visual-only models in stability and precision.

Contribution

The paper presents VTAM, a novel multimodal world model integrating tactile perception with video transformers through efficient finetuning and regularization, enhancing contact-rich manipulation capabilities.

Findings

01

Achieves 90% success rate in contact-rich tasks

02

Outperforms baseline by 80% in potato chip pick-and-place

03

Demonstrates the importance of tactile feedback in physical modeling

Abstract

Video-Action Models (VAMs) have emerged as a promising framework for embodied intelligence, learning implicit world dynamics from raw video streams to produce temporally consistent action predictions. Although such models demonstrate strong performance on long-horizon tasks through visual reasoning, they remain limited in contact-rich scenarios where critical interaction states are only partially observable from vision alone. In particular, fine-grained force modulation and contact transitions are not reliably encoded in visual tokens, leading to unstable or imprecise behaviors. To bridge this gap, we introduce the Video-Tactile Action Model (VTAM), a multimodal world modeling framework that incorporates tactile perception as a complementary grounding signal. VTAM augments a pretrained video transformer with tactile streams via a lightweight modality transfer finetuning, enabling…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Advanced Sensor and Energy Harvesting Materials · Robot Manipulation and Learning