MSACT: Multistage Spatial Alignment for Stable Low-Latency Fine Manipulation
Xianbo Cai, Hideyuki Ichiwara, Masaki Yoshikawa, Tetsuya Ogata

TL;DR
This paper presents MSACT, a multistage spatial attention approach that enhances stability and low-latency in bimanual fine manipulation tasks by aligning attention sequences with visual features to reduce drift.
Contribution
It introduces a novel multistage spatial attention module with a temporal alignment loss, improving visual localization stability without requiring keypoint annotations.
Findings
Improved localization stability in simulated and real-world tasks.
Enhanced task success rates with low-latency inference.
Reduced attention drift under visual disturbances.
Abstract
Real-world fine manipulation, particularly in bimanual manipulation, typically requires low-latency control and stable visual localization, while collecting large-scale data is costly and limited demonstrations may lead to localization drift. Existing approaches make different trade-offs: action-chunking policies such as ACT enable low-latency execution and data efficiency but rely on dense visual features without explicit spatial consistency, generative methods such as Diffusion Policy improve expressiveness but can incur iterative sampling latency, vision-language-action and voxel-based methods enhance generalization and geometric grounding but require higher computational cost and system complexity. We introduce a multistage spatial attention module that extracts stable 2D attention points and jointly predicts future attention sequences with a temporal alignment loss. Built upon ACT…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
