Active Stereo-Camera Outperforms Multi-Sensor Setup in ACT Imitation Learning for Humanoid Manipulation
Robin K\"uhn, Moritz Schappler, Thomas Seel, Dennis Bank

TL;DR
This study shows that an active stereo-camera setup can outperform complex multi-sensor configurations in imitation learning for humanoid manipulation, especially in data-limited scenarios.
Contribution
It benchmarks sensor combinations on a humanoid robot, introduces an open-source ablation framework, and demonstrates the effectiveness of active vision over additional modalities.
Findings
Active stereo-camera setup achieved 87.5% success in spatial generalization.
Adding pressure sensors reduced success rate due to low SNR.
Strategic sensor selection can outperform complex configurations in data-limited regimes.
Abstract
The complexity of teaching humanoid robots new tasks is one of the major reasons hindering their widespread adoption in the industry. While Imitation Learning (IL), particularly Action Chunking with Transformers (ACT), enables rapid task acquisition, there is no consensus yet on the optimal sensory hardware required for manipulation tasks. This paper benchmarks 14 sensor combinations on the Unitree G1 humanoid robot equipped with three-finger hands for two manipulation tasks. We explicitly evaluate the integration of tactile and proprioceptive modalities alongside active vision. Our analysis demonstrates that strategic sensor selection can outperform complex configurations in data-limited regimes while reducing computational overhead. We develop an open-source Unified Ablation Framework that utilizes sensor masking on a comprehensive master dataset. Results indicate that additional…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
