Stereo Multistage Spatial Attention for Real-Time Mobile Manipulation Under Visual Scale Variation and Disturbances

Xianbo Cai; Hideyuki Ichiwara; Hyogo Hiruma; Masaki Yoshikawa; Hiroshi Ito; Tetsuya Ogata

arXiv:2605.00471·cs.RO·May 4, 2026

Stereo Multistage Spatial Attention for Real-Time Mobile Manipulation Under Visual Scale Variation and Disturbances

Xianbo Cai, Hideyuki Ichiwara, Hyogo Hiruma, Masaki Yoshikawa, Hiroshi Ito, Tetsuya Ogata

PDF

TL;DR

This paper introduces a stereo multistage spatial attention-based deep learning approach for real-time mobile manipulation, effectively handling visual scale variations and disturbances in unstructured environments.

Contribution

It proposes a novel hierarchical recurrent architecture that extracts and integrates task-relevant spatial attention points from stereo images for improved manipulation robustness.

Findings

01

Enhanced robustness and success rates under visual disturbances.

02

Effective handling of scale variations in real-world tasks.

03

Outperforms baseline imitation learning and vision-language models.

Abstract

Robots operating in open, unstructured real-world environments must rely on onboard visual perception while autonomously moving across different locations. Continuous changes in onboard camera viewpoints cause significant visual scale variations in target objects, affecting vision-based motion generation. In this work, we present a stereo multistage spatial attention-based deep predictive learning method for real-time mobile manipulation. The proposed methods extracts task-relevant spatial attention points from stereo images and integrates them with robot states through a hierarchical recurrent architecture for closed-loop action prediction. We evaluate the system on four real-world mobile manipulation tasks using a mobile manipulator, including rigid placement, articulated object manipulation, and deformable object interaction. Experiments under randomized initial positions and visual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.