MSACT: Multistage Spatial Alignment for Stable Low-Latency Fine Manipulation

Xianbo Cai; Hideyuki Ichiwara; Masaki Yoshikawa; Tetsuya Ogata

arXiv:2605.00475·cs.RO·May 4, 2026

MSACT: Multistage Spatial Alignment for Stable Low-Latency Fine Manipulation

Xianbo Cai, Hideyuki Ichiwara, Masaki Yoshikawa, Tetsuya Ogata

PDF

TL;DR

This paper presents MSACT, a multistage spatial attention approach that enhances stability and low-latency in bimanual fine manipulation tasks by aligning attention sequences with visual features to reduce drift.

Contribution

It introduces a novel multistage spatial attention module with a temporal alignment loss, improving visual localization stability without requiring keypoint annotations.

Findings

01

Improved localization stability in simulated and real-world tasks.

02

Enhanced task success rates with low-latency inference.

03

Reduced attention drift under visual disturbances.

Abstract

Real-world fine manipulation, particularly in bimanual manipulation, typically requires low-latency control and stable visual localization, while collecting large-scale data is costly and limited demonstrations may lead to localization drift. Existing approaches make different trade-offs: action-chunking policies such as ACT enable low-latency execution and data efficiency but rely on dense visual features without explicit spatial consistency, generative methods such as Diffusion Policy improve expressiveness but can incur iterative sampling latency, vision-language-action and voxel-based methods enhance generalization and geometric grounding but require higher computational cost and system complexity. We introduce a multistage spatial attention module that extracts stable 2D attention points and jointly predicts future attention sequences with a temporal alignment loss. Built upon ACT…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.