TL;DR
VitaTouch is a multimodal model combining vision, tactile, and language data to improve material property inference and defect detection in robotic manufacturing, outperforming existing benchmarks.
Contribution
The paper introduces VitaTouch, a novel property-aware vision-tactile-language model with a new dataset, achieving state-of-the-art results in material property inference and defect recognition.
Findings
VitaTouch achieves 88.89% hardness accuracy on VitaSet.
It reaches 75.13% roughness accuracy and 54.81% descriptor recall.
It attains 100% defect recognition accuracy with fine-tuning.
Abstract
Quality inspection in smart manufacturing requires identifying intrinsic material and surface properties beyond visible geometry, yet vision-only methods remain vulnerable to occlusion and reflection. We propose VitaTouch, a property-aware vision-tactile-language model for material-property inference and natural-language attribute description. VitaTouch uses modality-specific encoders and a dual Q-Former to extract language-relevant visual and tactile features, which are compressed into prefix tokens for a large language model. We align each modality with text and explicitly couple vision and touch through contrastive learning. We also construct VitaSet, a multimodal dataset with 186 objects, 52k images, and 5.1k human-verified instruction-answer pairs. VitaTouch achieves the best performance on HCT and the overall TVL benchmark, while remaining competitive on SSVTP. On VitaSet, it…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
