VISTA: Open-Vocabulary, Task-Relevant Robot Exploration with Online Semantic Gaussian Splatting
Keiko Nagami, Timothy Chen, Javier Yu, Ola Shorinwa, Maximilian Adang, Carlyn Dougherty, Eric Cristofalo, Mac Schwager

TL;DR
VISTA is an active exploration method enabling robots to efficiently search for task-relevant objects using semantic-aware planning and real-time 3D scene reconstruction, outperforming existing methods in speed and success rate.
Contribution
VISTA introduces a novel semantic coverage metric and a planning approach for open-vocabulary, task-focused exploration with real-time 3D Gaussian Splatting reconstruction.
Findings
Outperforms state-of-the-art baselines in coverage speed and reconstruction quality.
Achieves 6x higher success rates in challenging environments.
Demonstrates platform-agnostic deployment on drone and quadruped robots.
Abstract
We present VISTA (Viewpoint-based Image selection with Semantic Task Awareness), an active exploration method for robots to plan informative trajectories that improve 3D map quality in areas most relevant for task completion. Given an open-vocabulary search instruction (e.g., "find a person"), VISTA enables a robot to explore its environment to search for the object of interest, while simultaneously building a real-time semantic 3D Gaussian Splatting reconstruction of the scene. The robot navigates its environment by planning receding-horizon trajectories that prioritize semantic similarity to the query and exploration of unseen regions of the environment. To evaluate trajectories, VISTA introduces a novel, efficient viewpoint-semantic coverage metric that quantifies both the geometric view diversity and task relevance in the 3D scene. On static datasets, our coverage metric outperforms…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobotics and Sensor-Based Localization · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
