TL;DR
ViTaPEs introduces a novel two-stage positional encoding method for multimodal transformers that enhances visuotactile representation learning, enabling better cross-modal alignment, generalization, and transfer learning in robotics.
Contribution
The paper proposes a new two-stage positional injection approach in transformer architectures for visuotactile data, improving cross-modal fusion and zero-shot generalization.
Findings
Outperforms state-of-the-art baselines on recognition tasks
Demonstrates strong zero-shot generalization to unseen scenarios
Excels in robotic grasping success prediction
Abstract
Tactile sensing provides local essential information that is complementary to visual perception, such as texture, compliance, and force. Despite recent advances in visuotactile representation learning, challenges remain in fusing these modalities and generalizing across tasks and environments without heavy reliance on pre-trained vision-language models. Moreover, existing methods do not study positional encodings, thereby overlooking the multi-stage spatial reasoning needed to capture fine-grained visuotactile correlations. We introduce ViTaPEs, a transformer-based architecture for learning task-agnostic visuotactile representations from paired vision and tactile inputs. Our key idea is a two-stage positional injection: local (modality-specific) positional encodings are added within each stream, and a global positional encoding is added on the joint token sequence immediately before…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
