Matching with Deliberation: Test-Time Evolutionary Hierarchical Multi-Agents for Zero-Shot Compositional Image Retrieval
Xingtian Pei, Yukun Song, Changwei Wang, Shunpeng Chen, Rongtao Xu, Shibiao Xu

TL;DR
This paper introduces a hierarchical multi-agent framework with self-evolution and test-time scaling for zero-shot compositional image retrieval, achieving state-of-the-art results.
Contribution
It presents the first integration of experience self-evolution and TTS into ZS-CIR, enabling dynamic perception and fine-grained reasoning.
Findings
Achieves SOTA on CIRR, CIRCO, and FashionIQ datasets.
Introduces a hierarchical multi-agent architecture with dynamic perception dispatch.
Demonstrates the effectiveness of self-evolution and TTS in zero-shot retrieval.
Abstract
Zero-Shot Compositional Image Retrieval (ZS-CIR) requires both preserving the visual continuity of the reference image and faithfully executing the semantic variables specified in the modification text, which constitutes the core challenge of the task. Existing methods often suffer from Perception Myopia in a single space, or fall into Logic Drift in iterative collaboration due to the perception ceiling of the underlying retriever. To address this issue, we propose a one-stop hierarchical Perception-to-Deliberation Framework (PDF), which, to the best of our knowledge, is the first to introduce experience self-evolution and Test-Time Scaling Law (TTS) into ZS-CIR. Relying on a hierarchical multi-agent architecture, PDF first utilizes an Intent Routing Manager to dynamically dispatch multi-view Worker perception signals based on modification intents to construct a high-recall candidate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
