Spacetime Optimal-Transport Attention for Visuo-Haptic Imitation Learning of Contact-Rich Manipulation
Yue Feng, Weicheng Huang, I-Ming Chen

TL;DR
This paper introduces SO-TA, a tri-modal attention mechanism using optimal transport for contact-rich manipulation tasks, improving robustness and interpretability over existing methods.
Contribution
We propose SO-TA, a novel attention backbone that explicitly models contact-rich interactions with structured OT constraints, enhancing multimodal fusion in imitation learning.
Findings
SO-TA achieves 100% success in peg-in-hole assembly, outperforming cross-attention.
SO-TA maintains high success rates under challenging conditions like occlusion and distractors.
OT-derived heatmaps offer interpretable insights into modality influence during tasks.
Abstract
Contact-rich manipulation tasks such as tight-clearance insertion, connector mating, polishing, and surface-conforming wiping remain difficult for data-driven controllers because they couple discontinuous contact dynamics, partial observability, and strict safety constraints. No single sensing modality suffices: vision supplies global context before contact, force/torque (F/T) feedback governs interaction after contact, and proprioceptive pose provides a consistent kinematic backbone. Most prior imitation-learning policies for contact-rich tasks operate on uni- or bi-modal signals, and the few that fuse three modalities typically adopt off-the-shelf attention modules with no explicit prior on how attention mass should be distributed across task-relevant regions. We present Spacetime Optimal-Transport Attention (SO-TA), a tri-modal fusion backbone that replaces softmax-normalized patch…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
