Towards Comprehensive Multimodal Perception: Introducing the   Touch-Language-Vision Dataset

Ning Cheng; You Li; Jing Gao; Bin Fang; Jinan Xu; Wenjuan Han

arXiv:2403.09813·cs.CV·June 18, 2024·3 cites

Towards Comprehensive Multimodal Perception: Introducing the Touch-Language-Vision Dataset

Ning Cheng, You Li, Jing Gao, Bin Fang, Jinan Xu, Wenjuan Han

PDF

Open Access

TL;DR

This paper introduces the TLV dataset for multimodal perception involving touch, language, and vision, and proposes a lightweight framework, STLV-Align, for semantic alignment across these modalities, advancing multimodal understanding in robotics and AI.

Contribution

The paper presents a new multimodal dataset combining touch, language, and vision with sentence-level descriptions, and a lightweight alignment framework for effective multimodal semantic integration.

Findings

01

Effective semantic alignment with only 1% parameter updates.

02

The TLV dataset enables richer multimodal perception.

03

STLV-Align outperforms existing methods in multimodal alignment.

Abstract

Tactility provides crucial support and enhancement for the perception and interaction capabilities of both humans and robots. Nevertheless, the multimodal research related to touch primarily focuses on visual and tactile modalities, with limited exploration in the domain of language. Beyond vocabulary, sentence-level descriptions contain richer semantics. Based on this, we construct a touch-language-vision dataset named TLV (Touch-Language-Vision) by human-machine cascade collaboration, featuring sentence-level descriptions for multimode alignment. The new dataset is used to fine-tune our proposed lightweight training framework, STLV-Align (Synergistic Touch-Language-Vision Alignment), achieving effective semantic alignment with minimal parameter adjustments (1%). Project Page: https://xiaoen0.github.io/touch.page/.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpatial Cognition and Navigation · Categorization, perception, and language