Thinking in 360{\deg}: Humanoid Visual Search in the Wild
Heyang Yu, Yinan Han, Xiangyu Zhang, Baiqiao Yin, Bowen Chang, Xiangyu Han, Xinhao Liu, Jing Zhang, Marco Pavone, Chen Feng, Saining Xie, Yiming Li

TL;DR
This paper introduces humanoid visual search in 360-degree environments, proposing a new benchmark and demonstrating significant improvements in open-source model performance for complex real-world scenes.
Contribution
It develops a humanoid agent for active visual search in 360-degree scenes and introduces H* Bench, a challenging new benchmark for in-the-wild visual reasoning.
Findings
Top-tier models achieve only ~30% success in complex scenes.
Post-training boosts open-source model success threefold.
Path search remains inherently difficult due to spatial reasoning demands.
Abstract
Humans rely on the synergistic control of head (cephalomotor) and eye (oculomotor) to efficiently search for visual information in 360{\deg}. However, prior approaches to visual search are limited to a static image, neglecting the physical embodiment and its interaction with the 3D world. How can we develop embodied visual search agents as efficient as humans while bypassing the constraints imposed by real-world hardware? To this end, we propose humanoid visual search where a humanoid agent actively rotates its head to search for objects or paths in an immersive world represented by a 360{\deg} panoramic image. To study visual search in visually-crowded real-world scenarios, we build H* Bench, a new benchmark that moves beyond household scenes to challenging in-the-wild scenes that necessitate advanced visual-spatial reasoning capabilities, such as transportation hubs, large-scale…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robotics and Sensor-Based Localization · Social Robot Interaction and HRI
