ViTaPEs: Visuotactile Position Encodings for Cross-Modal Alignment in Multimodal Transformers

Fotios Lygerakis; Ozan \"Ozdenizci; Elmar R\"uckert

arXiv:2505.20032·cs.CV·April 30, 2026

ViTaPEs: Visuotactile Position Encodings for Cross-Modal Alignment in Multimodal Transformers

Fotios Lygerakis, Ozan \"Ozdenizci, Elmar R\"uckert

PDF

1 Repo

TL;DR

ViTaPEs introduces a novel two-stage positional encoding method for multimodal transformers that enhances visuotactile representation learning, enabling better cross-modal alignment, generalization, and transfer learning in robotics.

Contribution

The paper proposes a new two-stage positional injection approach in transformer architectures for visuotactile data, improving cross-modal fusion and zero-shot generalization.

Findings

01

Outperforms state-of-the-art baselines on recognition tasks

02

Demonstrates strong zero-shot generalization to unseen scenarios

03

Excels in robotic grasping success prediction

Abstract

Tactile sensing provides local essential information that is complementary to visual perception, such as texture, compliance, and force. Despite recent advances in visuotactile representation learning, challenges remain in fusing these modalities and generalizing across tasks and environments without heavy reliance on pre-trained vision-language models. Moreover, existing methods do not study positional encodings, thereby overlooking the multi-stage spatial reasoning needed to capture fine-grained visuotactile correlations. We introduce ViTaPEs, a transformer-based architecture for learning task-agnostic visuotactile representations from paired vision and tactile inputs. Our key idea is a two-stage positional injection: local (modality-specific) positional encodings are added within each stream, and a global positional encoding is added on the joint token sequence immediately before…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

null
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.