Scene Exploration by Vision-Language Models

Venkatesh Sripada; Samuel Carter; Frank Guerin; Amir Ghalamzan

arXiv:2409.17641·cs.RO·June 10, 2025

Scene Exploration by Vision-Language Models

Venkatesh Sripada, Samuel Carter, Frank Guerin, Amir Ghalamzan

PDF

Open Access

TL;DR

This paper introduces AP-VLM, a framework combining active perception with vision-language models to improve robotic scene exploration and semantic understanding in complex, partially observable environments.

Contribution

The paper presents a novel active perception framework that integrates vision-language models for robotic exploration and semantic querying, enabling adaptive viewpoint selection.

Findings

01

AP-VLM outperforms passive perception methods in object identification.

02

The system effectively guides robots in complex scenes with occlusions.

03

AP-VLM demonstrates adaptability across different robotic platforms.

Abstract

Active perception enables robots to dynamically gather information by adjusting their viewpoints, a crucial capability for interacting with complex, partially observable environments. In this paper, we present AP-VLM, a novel framework that combines active perception with a Vision-Language Model (VLM) to guide robotic exploration and answer semantic queries. Using a 3D virtual grid overlaid on the scene and orientation adjustments, AP-VLM allows a robotic manipulator to intelligently select optimal viewpoints and orientations to resolve challenging tasks, such as identifying objects in occluded or inclined positions. We evaluate our system on two robotic platforms: a 7-DOF Franka Panda and a 6-DOF UR5, across various scenes with differing object configurations. Our results demonstrate that AP-VLM significantly outperforms passive perception methods and baseline models, including Toward…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRetinal Imaging and Analysis · Blind Source Separation Techniques · Image Processing Techniques and Applications