StereoPolicy: Improving Robotic Manipulation Policies via Stereo Perception
Evans Han, Yunfan Jiang, Yingke Wang, Haoyue Xiao, Huang Huang, Jianwen Xie, Jiajun Wu, Li Fei-Fei, Ruohan Zhang

TL;DR
StereoPolicy enhances robotic manipulation by leveraging stereo image pairs to improve geometric reasoning without explicit 3D reconstruction, outperforming various baselines in simulation and real-world tests.
Contribution
Introduces StereoPolicy, a novel framework that uses stereo vision and a Stereo Transformer to improve visuomotor policies without requiring 3D reconstruction or calibration.
Findings
StereoPolicy outperforms RGB, RGB-D, point cloud, and multi-view baselines in simulation benchmarks.
StereoPolicy demonstrates effective real-robot manipulation in tabletop and bimanual settings.
Using stereo vision improves geometric reasoning and manipulation accuracy.
Abstract
Recent advances in robot imitation learning have yielded powerful visuomotor policies capable of manipulating a wide variety of objects directly from monocular visual inputs. However, monocular observations inherently lack reliable depth cues and spatial awareness, which are critical for precise manipulation in cluttered or geometrically complex scenes. To address this limitation, we introduce StereoPolicy, a new visuomotor policy learning framework that directly leverages synchronized stereo image pairs to strengthen geometric reasoning, without requiring explicit 3D reconstruction or camera calibration. StereoPolicy employs pretrained 2D vision encoders to process each image independently and fuses the resulting representations through a Stereo Transformer. This design implicitly captures spatial correspondence and disparity cues. The framework integrates seamlessly with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
