EPIC-Bench: A Perception-Centric Benchmark for Fine-Grained Embodied Visual Grounding in Vision-Language Models

Haozhe Shan; Xiancong Ren; Han Dong; Haoyuan Shi; Yingji Zhang; Jiayu Hu; Yi Zhang; Yong Dai; Bin Shen; Lizhen Qu; Zenglin Xu; Xiaozhu Ju

arXiv:2605.17070·cs.CV·May 19, 2026

EPIC-Bench: A Perception-Centric Benchmark for Fine-Grained Embodied Visual Grounding in Vision-Language Models

Haozhe Shan, Xiancong Ren, Han Dong, Haoyuan Shi, Yingji Zhang, Jiayu Hu, Yi Zhang, Yong Dai, Bin Shen, Lizhen Qu, Zenglin Xu, Xiaozhu Ju

PDF

1 Datasets

TL;DR

EPIC-Bench is a comprehensive, fine-grained benchmark designed to evaluate the visual grounding capabilities of vision-language models in real-world embodied tasks, highlighting current limitations.

Contribution

The paper introduces EPIC-Bench, a new benchmark with 6.6k annotated tuples across 23 tasks, specifically targeting visual perception in embodied environments.

Findings

01

Current VLMs struggle with complex visual-text alignment for physical interactions.

02

Models show bottlenecks in multi-target counting and part-whole understanding.

03

Advanced reasoning models still face significant challenges in embodied visual grounding.

Abstract

While large vision-language models (VLMs) are increasingly adopted as the perceptual backbone for embodied agents, existing benchmarks often rely on question-answering or multiple-choice formats. These protocols allow models to exploit linguistic priors rather than demonstrating genuine visual grounding. To address this, we present EPIC-Bench, Embodied PerceptIon BenChmark, a fine-grained grounding benchmark designed to systematically evaluate the visual perceptual capabilities of VLMs in real-world embodied environments. Comprising 6.6k meticulously annotated tuples (Image, Text, Mask), EPIC-Bench spans 23 fine-grained tasks across three core stages of the embodied interaction pipeline: Target Localization, Navigation, and Manipulation. Extensive evaluations of over 89 leading VLMs reveal that while advanced reasoning models show promise, current VLMs universally struggle with complex…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

rxc205/EPIC-Bench
dataset· 971 dl
971 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.