Cross-modal RAG: Sub-dimensional Text-to-Image Retrieval-Augmented Generation
Mengdan Zhu, Senhao Cheng, Guangji Bai, Yifei Zhang, Liang Zhao

TL;DR
Cross-modal RAG introduces a sub-dimensional retrieval and generation framework that decomposes queries and images into components, enabling more precise retrieval and synthesis in complex text-to-image tasks.
Contribution
It presents a novel sub-dimensional decomposition approach with hybrid retrieval and subquery-aware generation, improving over existing RAG methods for complex image synthesis.
Findings
Outperforms baselines in retrieval accuracy on multiple datasets.
Enhances generation quality with subquery-aware visual conditioning.
Maintains high efficiency in retrieval and generation processes.
Abstract
Text-to-image generation increasingly demands access to domain-specific, fine-grained, and rapidly evolving knowledge that pretrained models cannot fully capture, necessitating the integration of retrieval methods. Existing Retrieval-Augmented Generation (RAG) methods attempt to address this by retrieving globally relevant images, but they fail when no single image contains all desired elements from a complex user query. We propose Cross-modal RAG, a novel framework that decomposes both queries and images into sub-dimensional components, enabling subquery-aware retrieval and generation. Our method introduces a hybrid retrieval strategy - combining a sub-dimensional sparse retriever with a dense retriever - to identify a Pareto-optimal set of images, each contributing complementary aspects of the query. During generation, a multimodal large language model is guided to selectively…
Peer Reviews
Decision·Submitted to ICLR 2026
1. Proposes a sub-dimensional decomposition mechanism for cross-modal retrieval-augmented generation that is simple yet effective. 2. Achieves consistent improvements in both retrieval and generation. 3. The hybrid retrieval design (sparse + dense) achieves a good balance between accuracy and efficiency.
1. The paper’s novelty is not sufficiently articulated. The concept of Cross-modal RAG has appeared in prior studies (e.g., VisRet), so the authors should more clearly highlight their unique contribution—particularly the multi-dimensional decomposition mechanism for complex multimodal semantics. Comparative or visualization-based analyses (e.g., subquery–feature alignment) would help strengthen the differentiation. 2. The proposed “Pareto-optimal hybrid retrieval” remains largely heuristic and l
1. The paper presents a clear and novel perspective by modeling retrieval as a multi-objective optimization over subqueries. 2. The hybrid sparse–dense retrieval design is technically sound and efficiently implemented. 3. The paper provides extensive experimental results and clear visualizations to support the proposed idea. 4. The Pareto-optimal formulation and subquery-aware conditioning in generation are conceptually elegant and potentially generalizable.
1. The reported BLIP-2 results on MSCOCO deviate substantially from those in the original BLIP-2 paper, raising concerns about the fairness or correctness of baseline reproduction. 2. Evaluation on retrieval benchmarks (e.g., MSCOCO, Flickr30K) is insufficient to reflect the RAG ability for generation. These datasets primarily test retrieval performance, not retrieval-augmented reasoning or compositional synthesis, which are central to RAG. 3. The design of Stage 1 and Stage 2 is very complicate
1.The dual decomposition of queries and images into sub-dimensions is creative and distinguishes it from prior work. 2. The multi-stage framework is well-explained, and the algorithm is detailed with complexity analysis. 3. The proofs are logically structured and align well with the framework’s design.
1.While the authors mention FineRAG briefly in Related Work, no experimental results are provided to directly contrast the two methods. 2.The approach relies heavily on LLMs or query decomposition. Evaluating robustness to different decomposers would strengthen the work. 3. Minor Errors: (1) Line 249, Redundant use of “by” in the sentence: “dominated by by any other image...” (2) Line 277, The term “cos” appears abruptly without prior definition; it should likely align with the earlier no
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Linear Layer · Byte Pair Encoding · Attention Dropout · Softmax · WordPiece · BART · Weight Decay · Multi-Head Attention · Dropout
