OmniVTLA: Vision-Tactile-Language-Action Model with Semantic-Aligned Tactile Sensing

Zhengxue Cheng; Yiqian Zhang; Wenkang Zhang; Haoyu Li; Keyu Wang; Li Song; Hengdi Zhang

arXiv:2508.08706·cs.RO·August 25, 2025

OmniVTLA: Vision-Tactile-Language-Action Model with Semantic-Aligned Tactile Sensing

Zhengxue Cheng, Yiqian Zhang, Wenkang Zhang, Haoyu Li, Keyu Wang, Li Song, Hengdi Zhang

PDF

TL;DR

OmniVTLA is a new multi-modal model integrating vision, tactile, and language data to improve robot manipulation, especially contact-rich tasks, by using a dual-path tactile encoder and a comprehensive tactile dataset.

Contribution

The paper introduces OmniVTLA with a dual-path tactile encoder, a new tactile dataset ObjTac, and demonstrates improved real-world manipulation performance.

Findings

01

Achieves 96.9% success rate in pick-and-place tasks with grippers.

02

Outperforms state-of-the-art VLA models by 21.9% in success rate.

03

Reduces task completion time and produces smoother trajectories.

Abstract

Recent vision-language-action (VLA) models build upon vision-language foundations, and have achieved promising results and exhibit the possibility of task generalization in robot manipulation. However, due to the heterogeneity of tactile sensors and the difficulty of acquiring tactile data, current VLA models significantly overlook the importance of tactile perception and fail in contact-rich tasks. To address this issue, this paper proposes OmniVTLA, a novel architecture involving tactile sensing. Specifically, our contributions are threefold. First, our OmniVTLA features a dual-path tactile encoder framework. This framework enhances tactile perception across diverse vision-based and force-based tactile sensors by using a pretrained vision transformer (ViT) and a semantically-aligned tactile ViT (SA-ViT). Second, we introduce ObjTac, a comprehensive force-based tactile dataset…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.