Universal Visuo-Tactile Video Understanding for Embodied Interaction
Yifan Xie, Mingyang Li, Shoujie Li, Xingting Li, Guangyu Chen, Fei Ma, Fei Richard Yu, Wenbo Ding

TL;DR
This paper introduces VTV-LLM, a multi-modal large language model that integrates visual and tactile video data for enhanced embodied interaction and tactile understanding.
Contribution
It presents the first visuo-tactile video understanding model with a new dataset and a three-stage training paradigm for cross-modal tactile reasoning.
Findings
VTV-LLM outperforms existing models in tactile video understanding tasks.
The dataset VTV150K enables diverse tactile attribute annotations.
The framework supports complex tactile reasoning and decision making.
Abstract
Tactile perception is essential for embodied agents to understand physical attributes of objects that cannot be determined through visual inspection alone. While existing approaches have made progress in visual and language modalities for physical understanding, they fail to effectively incorporate tactile information that provides crucial haptic feedback for real-world interaction. In this paper, we present VTV-LLM, the first multi-modal large language model for universal Visuo-Tactile Video (VTV) understanding that bridges the gap between tactile perception and natural language. To address the challenges of cross-sensor and cross-modal integration, we contribute VTV150K, a comprehensive dataset comprising 150,000 video frames from 100 diverse objects captured across three different tactile sensors (GelSight Mini, DIGIT, and Tac3D), annotated with four fundamental tactile attributes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTactile and Sensory Interactions · Advanced Sensor and Energy Harvesting Materials · Multimodal Machine Learning Applications
