Cross-modal RAG: Sub-dimensional Text-to-Image Retrieval-Augmented Generation

Mengdan Zhu; Senhao Cheng; Guangji Bai; Yifei Zhang; Liang Zhao

arXiv:2505.21956·cs.CV·September 30, 2025

Cross-modal RAG: Sub-dimensional Text-to-Image Retrieval-Augmented Generation

Mengdan Zhu, Senhao Cheng, Guangji Bai, Yifei Zhang, Liang Zhao

PDF

Open Access 1 Repo 3 Reviews

TL;DR

Cross-modal RAG introduces a sub-dimensional retrieval and generation framework that decomposes queries and images into components, enabling more precise retrieval and synthesis in complex text-to-image tasks.

Contribution

It presents a novel sub-dimensional decomposition approach with hybrid retrieval and subquery-aware generation, improving over existing RAG methods for complex image synthesis.

Findings

01

Outperforms baselines in retrieval accuracy on multiple datasets.

02

Enhances generation quality with subquery-aware visual conditioning.

03

Maintains high efficiency in retrieval and generation processes.

Abstract

Text-to-image generation increasingly demands access to domain-specific, fine-grained, and rapidly evolving knowledge that pretrained models cannot fully capture, necessitating the integration of retrieval methods. Existing Retrieval-Augmented Generation (RAG) methods attempt to address this by retrieving globally relevant images, but they fail when no single image contains all desired elements from a complex user query. We propose Cross-modal RAG, a novel framework that decomposes both queries and images into sub-dimensional components, enabling subquery-aware retrieval and generation. Our method introduces a hybrid retrieval strategy - combining a sub-dimensional sparse retriever with a dense retriever - to identify a Pareto-optimal set of images, each contributing complementary aspects of the query. During generation, a multimodal large language model is guided to selectively…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

1. Proposes a sub-dimensional decomposition mechanism for cross-modal retrieval-augmented generation that is simple yet effective. 2. Achieves consistent improvements in both retrieval and generation. 3. The hybrid retrieval design (sparse + dense) achieves a good balance between accuracy and efficiency.

Weaknesses

1. The paper’s novelty is not sufficiently articulated. The concept of Cross-modal RAG has appeared in prior studies (e.g., VisRet), so the authors should more clearly highlight their unique contribution—particularly the multi-dimensional decomposition mechanism for complex multimodal semantics. Comparative or visualization-based analyses (e.g., subquery–feature alignment) would help strengthen the differentiation. 2. The proposed “Pareto-optimal hybrid retrieval” remains largely heuristic and l

Reviewer 02Rating 4Confidence 4

Strengths

1. The paper presents a clear and novel perspective by modeling retrieval as a multi-objective optimization over subqueries. 2. The hybrid sparse–dense retrieval design is technically sound and efficiently implemented. 3. The paper provides extensive experimental results and clear visualizations to support the proposed idea. 4. The Pareto-optimal formulation and subquery-aware conditioning in generation are conceptually elegant and potentially generalizable.

Weaknesses

1. The reported BLIP-2 results on MSCOCO deviate substantially from those in the original BLIP-2 paper, raising concerns about the fairness or correctness of baseline reproduction. 2. Evaluation on retrieval benchmarks (e.g., MSCOCO, Flickr30K) is insufficient to reflect the RAG ability for generation. These datasets primarily test retrieval performance, not retrieval-augmented reasoning or compositional synthesis, which are central to RAG. 3. The design of Stage 1 and Stage 2 is very complicate

Reviewer 03Rating 6Confidence 3

Strengths

1.The dual decomposition of queries and images into sub-dimensions is creative and distinguishes it from prior work. 2. The multi-stage framework is well-explained, and the algorithm is detailed with complexity analysis. 3. The proofs are logically structured and align well with the framework’s design.

Weaknesses

1.While the authors mention FineRAG briefly in Related Work, no experimental results are provided to directly contrast the two methods. 2.The approach relies heavily on LLMs or query decomposition. Evaluating robustness to different decomposers would strengthen the work. 3. Minor Errors: (1) Line 249, Redundant use of “by” in the sentence: “dominated by by any other image...” (2) Line 277, The term “cos” appears abruptly without prior definition; it should likely align with the earlier no

Code & Models

Repositories

mengdanzhu/cross-modal-rag
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Linear Layer · Byte Pair Encoding · Attention Dropout · Softmax · WordPiece · BART · Weight Decay · Multi-Head Attention · Dropout