Fix Before Search: Benchmarking Agentic Query Visual Pre-processing in Multimodal Retrieval-augmented Generation

Jiankun Zhang; Shenglai Zeng; Kai Guo; Xinnan Dai; Hui Liu; Jiliang Tang; Yi Chang

arXiv:2602.13179·cs.IR·February 16, 2026

Fix Before Search: Benchmarking Agentic Query Visual Pre-processing in Multimodal Retrieval-augmented Generation

Jiankun Zhang, Shenglai Zeng, Kai Guo, Xinnan Dai, Hui Liu, Jiliang Tang, Yi Chang

PDF

Open Access

TL;DR

This paper introduces V-QPP-Bench, a new benchmark for visual query pre-processing in multimodal retrieval-augmented generation, highlighting the importance of handling imperfect visual inputs for improved retrieval performance.

Contribution

It formulates visual query pre-processing as an agentic decision-making task and provides extensive evaluation revealing key insights into the challenges and potential solutions.

Findings

01

Visual imperfections significantly impair retrieval and MRAG performance.

02

Oracle preprocessing can nearly recover perfect performance.

03

Supervised fine-tuning enables smaller models to outperform larger ones.

Abstract

Multimodal Retrieval-Augmented Generation (MRAG) has emerged as a key paradigm for grounding MLLMs with external knowledge. While query pre-processing (e.g., rewriting) is standard in text-based RAG, existing MRAG pipelines predominantly treat visual inputs as static and immutable, implicitly assuming they are noise-free. However, real-world visual queries are often ``imperfect'' -- suffering from geometric distortions, quality degradation, or semantic ambiguity -- leading to catastrophic retrieval failures. To address this gap, we propose V-QPP-Bench, the first comprehensive benchmark dedicated to Visual Query Pre-processing (V-QPP). We formulate V-QPP as an agentic decision-making task where MLLMs must autonomously diagnose imperfections and deploy perceptual tools to refine queries. Our extensive evaluation across 46,700 imperfect queries and diverse MRAG paradigms reveals three…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Information Retrieval and Search Behavior · Advanced Image and Video Retrieval Techniques