OmniVTLA: Vision-Tactile-Language-Action Model with Semantic-Aligned Tactile Sensing
Zhengxue Cheng, Yiqian Zhang, Wenkang Zhang, Haoyu Li, Keyu Wang, Li Song, Hengdi Zhang

TL;DR
OmniVTLA is a new multi-modal model integrating vision, tactile, and language data to improve robot manipulation, especially contact-rich tasks, by using a dual-path tactile encoder and a comprehensive tactile dataset.
Contribution
The paper introduces OmniVTLA with a dual-path tactile encoder, a new tactile dataset ObjTac, and demonstrates improved real-world manipulation performance.
Findings
Achieves 96.9% success rate in pick-and-place tasks with grippers.
Outperforms state-of-the-art VLA models by 21.9% in success rate.
Reduces task completion time and produces smoother trajectories.
Abstract
Recent vision-language-action (VLA) models build upon vision-language foundations, and have achieved promising results and exhibit the possibility of task generalization in robot manipulation. However, due to the heterogeneity of tactile sensors and the difficulty of acquiring tactile data, current VLA models significantly overlook the importance of tactile perception and fail in contact-rich tasks. To address this issue, this paper proposes OmniVTLA, a novel architecture involving tactile sensing. Specifically, our contributions are threefold. First, our OmniVTLA features a dual-path tactile encoder framework. This framework enhances tactile perception across diverse vision-based and force-based tactile sensors by using a pretrained vision transformer (ViT) and a semantically-aligned tactile ViT (SA-ViT). Second, we introduce ObjTac, a comprehensive force-based tactile dataset…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
