From Web to Pixels: Bringing Agentic Search into Visual Perception

Bokang Yang; Xinyi Sun; Kaituo Feng; Xingping Dong; Dongming Wu; Xiangyu Yue

arXiv:2605.12497·cs.CV·May 13, 2026

From Web to Pixels: Bringing Agentic Search into Visual Perception

Bokang Yang, Xinyi Sun, Kaituo Feng, Xingping Dong, Dongming Wu, Xiangyu Yue

PDF

1 Datasets

TL;DR

This paper introduces WebEye, a benchmark for knowledge-driven visual perception tasks, and Pixel-Searcher, an agentic system that improves open-world object localization and understanding.

Contribution

It formalizes a new perception challenge involving external knowledge, and proposes a novel benchmark and method to address it.

Findings

01

Pixel-Searcher outperforms existing open-source methods across tasks.

02

Failures mainly occur in evidence acquisition and identity resolution.

03

WebEye provides a comprehensive dataset for knowledge-based visual perception.

Abstract

Visual perception connects high-level semantic understanding to pixel-level perception, but most existing settings assume that the decisive evidence for identifying a target is already in the image or frozen model knowledge. We study a more practical yet harder open-world case where a visible object must first be resolved from external facts, recent events, long-tail entities, or multi-hop relations before it can be localized. We formalize this challenge as Perception Deep Research and introduce WebEye, an object-anchored benchmark with verifiable evidence, knowledge-intensive queries, precise box/mask annotations, and three task views: Search-based Grounding, Search-based Segmentation, and Search-based VQA. WebEyes contains 120 images, 473 annotated object instances, 645 unique QA pairs, and 1,927 task samples. We further propose Pixel-Searcher, an agentic search-to-pixel workflow that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

yangbokang81/WebEyes
dataset· 754 dl
754 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.