Collaborative Representation Learning for Alignment of Tactile, Language, and Vision Modalities

Yiyun Zhou; Mingjing Xu; Jingwei Shi; Quanjiang Li; Jingyuan Chen

arXiv:2511.11512·cs.RO·February 3, 2026

Collaborative Representation Learning for Alignment of Tactile, Language, and Vision Modalities

Yiyun Zhou, Mingjing Xu, Jingwei Shi, Quanjiang Li, Jingyuan Chen

PDF

Open Access 1 Video

TL;DR

This paper introduces TLV-CoRe, a novel collaborative learning framework that enhances the integration and generalization of tactile, language, and vision data in robotics, addressing sensor variability and improving cross-modal alignment.

Contribution

The paper presents a new CLIP-based method with a Sensor-Aware Modulator, tactile-irrelevant decoupling, and a Unified Bridging Adapter for improved multimodal tactile representation.

Findings

01

TLV-CoRe improves sensor-agnostic tactile feature learning.

02

The method enhances cross-modal alignment across modalities.

03

Experimental results validate the effectiveness of the proposed approach.

Abstract

Tactile sensing offers rich and complementary information to vision and language, enabling robots to perceive fine-grained object properties. However, existing tactile sensors lack standardization, leading to redundant features that hinder cross-sensor generalization. Moreover, existing methods fail to fully integrate the intermediate communication among tactile, language, and vision modalities. To address this, we propose TLV-CoRe, a CLIP-based Tactile-Language-Vision Collaborative Representation learning method. TLV-CoRe introduces a Sensor-Aware Modulator to unify tactile features across different sensors and employs tactile-irrelevant decoupled learning to disentangle irrelevant tactile features. Additionally, a Unified Bridging Adapter is introduced to enhance tri-modal interaction within the shared representation space. To fairly evaluate the effectiveness of tactile models, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Collaborative Representation Learning for Alignment of Tactile, Language, and Vision Modalities· underline

Taxonomy

TopicsAdvanced Sensor and Energy Harvesting Materials · Robot Manipulation and Learning · Tactile and Sensory Interactions