EPIC-Bench: A Perception-Centric Benchmark for Fine-Grained Embodied Visual Grounding in Vision-Language Models
Haozhe Shan, Xiancong Ren, Han Dong, Haoyuan Shi, Yingji Zhang, Jiayu Hu, Yi Zhang, Yong Dai, Bin Shen, Lizhen Qu, Zenglin Xu, Xiaozhu Ju

TL;DR
EPIC-Bench is a comprehensive, fine-grained benchmark designed to evaluate the visual grounding capabilities of vision-language models in real-world embodied tasks, highlighting current limitations.
Contribution
The paper introduces EPIC-Bench, a new benchmark with 6.6k annotated tuples across 23 tasks, specifically targeting visual perception in embodied environments.
Findings
Current VLMs struggle with complex visual-text alignment for physical interactions.
Models show bottlenecks in multi-target counting and part-whole understanding.
Advanced reasoning models still face significant challenges in embodied visual grounding.
Abstract
While large vision-language models (VLMs) are increasingly adopted as the perceptual backbone for embodied agents, existing benchmarks often rely on question-answering or multiple-choice formats. These protocols allow models to exploit linguistic priors rather than demonstrating genuine visual grounding. To address this, we present EPIC-Bench, Embodied PerceptIon BenChmark, a fine-grained grounding benchmark designed to systematically evaluate the visual perceptual capabilities of VLMs in real-world embodied environments. Comprising 6.6k meticulously annotated tuples (Image, Text, Mask), EPIC-Bench spans 23 fine-grained tasks across three core stages of the embodied interaction pipeline: Target Localization, Navigation, and Manipulation. Extensive evaluations of over 89 leading VLMs reveal that while advanced reasoning models show promise, current VLMs universally struggle with complex…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
