Spacetime Optimal-Transport Attention for Visuo-Haptic Imitation Learning of Contact-Rich Manipulation

Yue Feng; Weicheng Huang; I-Ming Chen

arXiv:2605.20433·cs.RO·May 21, 2026

Spacetime Optimal-Transport Attention for Visuo-Haptic Imitation Learning of Contact-Rich Manipulation

Yue Feng, Weicheng Huang, I-Ming Chen

PDF

TL;DR

This paper introduces SO-TA, a tri-modal attention mechanism using optimal transport for contact-rich manipulation tasks, improving robustness and interpretability over existing methods.

Contribution

We propose SO-TA, a novel attention backbone that explicitly models contact-rich interactions with structured OT constraints, enhancing multimodal fusion in imitation learning.

Findings

01

SO-TA achieves 100% success in peg-in-hole assembly, outperforming cross-attention.

02

SO-TA maintains high success rates under challenging conditions like occlusion and distractors.

03

OT-derived heatmaps offer interpretable insights into modality influence during tasks.

Abstract

Contact-rich manipulation tasks such as tight-clearance insertion, connector mating, polishing, and surface-conforming wiping remain difficult for data-driven controllers because they couple discontinuous contact dynamics, partial observability, and strict safety constraints. No single sensing modality suffices: vision supplies global context before contact, force/torque (F/T) feedback governs interaction after contact, and proprioceptive pose provides a consistent kinematic backbone. Most prior imitation-learning policies for contact-rich tasks operate on uni- or bi-modal signals, and the few that fuse three modalities typically adopt off-the-shelf attention modules with no explicit prior on how attention mass should be distributed across task-relevant regions. We present Spacetime Optimal-Transport Attention (SO-TA), a tri-modal fusion backbone that replaces softmax-normalized patch…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.