From Web to Pixels: Bringing Agentic Search into Visual Perception
Bokang Yang, Xinyi Sun, Kaituo Feng, Xingping Dong, Dongming Wu, Xiangyu Yue

TL;DR
This paper introduces WebEye, a benchmark for knowledge-driven visual perception tasks, and Pixel-Searcher, an agentic system that improves open-world object localization and understanding.
Contribution
It formalizes a new perception challenge involving external knowledge, and proposes a novel benchmark and method to address it.
Findings
Pixel-Searcher outperforms existing open-source methods across tasks.
Failures mainly occur in evidence acquisition and identity resolution.
WebEye provides a comprehensive dataset for knowledge-based visual perception.
Abstract
Visual perception connects high-level semantic understanding to pixel-level perception, but most existing settings assume that the decisive evidence for identifying a target is already in the image or frozen model knowledge. We study a more practical yet harder open-world case where a visible object must first be resolved from external facts, recent events, long-tail entities, or multi-hop relations before it can be localized. We formalize this challenge as Perception Deep Research and introduce WebEye, an object-anchored benchmark with verifiable evidence, knowledge-intensive queries, precise box/mask annotations, and three task views: Search-based Grounding, Search-based Segmentation, and Search-based VQA. WebEyes contains 120 images, 473 annotated object instances, 645 unique QA pairs, and 1,927 task samples. We further propose Pixel-Searcher, an agentic search-to-pixel workflow that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
