E3VS-Bench: A Benchmark for Viewpoint-Dependent Active Perception in 3D Gaussian Splatting Scenes

Koya Sakamoto; Taiki Miyanishi; Daichi Azuma; Shuhei Kurita; Shu Morikuni; Naoya Chiba; Motoaki Kawanabe; Yusuke Iwasawa; Yutaka Matsuo

arXiv:2604.17969·cs.CV·April 24, 2026

E3VS-Bench: A Benchmark for Viewpoint-Dependent Active Perception in 3D Gaussian Splatting Scenes

Koya Sakamoto, Taiki Miyanishi, Daichi Azuma, Shuhei Kurita, Shu Morikuni, Naoya Chiba, Motoaki Kawanabe, Yusuke Iwasawa, Yutaka Matsuo

PDF

TL;DR

E3VS-Bench is a new benchmark for evaluating embodied agents' ability to perform viewpoint-dependent visual search in 3D scenes, emphasizing the importance of active perception with 5-DoF control.

Contribution

The paper introduces E3VS-Bench, a high-fidelity 3D scene benchmark with questions requiring active viewpoint control, enabling evaluation of perception and planning in unrestricted 5-DoF environments.

Findings

01

Models perform significantly worse than humans on viewpoint-dependent tasks.

02

Photorealistic rendering with 3D Gaussian Splatting captures fine details for complex questions.

03

Current models show limitations in active perception and viewpoint planning.

Abstract

Visual search in 3D environments requires embodied agents to actively explore their surroundings and acquire task-relevant evidence. However, existing visual search and embodied AI benchmarks, including EQA, typically rely on static observations or constrained egocentric motion, and thus do not explicitly evaluate fine-grained viewpoint-dependent phenomena that arise under unrestricted 5-DoF viewpoint control in real-world 3D environments, such as visibility changes caused by vertical viewpoint shifts, revealing contents inside containers, and disambiguating object attributes that are only observable from specific angles. To address this limitation, we introduce {E3VS-Bench}, a benchmark for embodied 3D visual search where agents must control their viewpoints in 5-DoF to gather viewpoint-dependent evidence for question answering. E3VS-Bench consists of 99 high-fidelity 3D scenes…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.