ImageRAG: Dynamic Image Retrieval for Reference-Guided Image Generation
Rotem Shalev-Arkushin, Rinon Gal, Amit H. Bermano, Ohad Fried

TL;DR
ImageRAG introduces a flexible retrieval-augmented approach that dynamically fetches relevant images to guide diffusion-based image generation, significantly improving the synthesis of rare and fine-grained concepts without retraining models.
Contribution
It presents a novel, adaptable method that enhances image generation by leveraging existing models and dynamic retrieval, avoiding the need for RAG-specific training.
Findings
Improves generation of rare and unseen concepts.
Applicable across various diffusion models.
No additional training required for retrieval integration.
Abstract
Diffusion models enable high-quality and diverse visual content synthesis. However, they struggle to generate rare or unseen concepts. To address this challenge, we explore the usage of Retrieval-Augmented Generation (RAG) with image generation models. We propose ImageRAG, a method that dynamically retrieves relevant images based on a given text prompt, and uses them as context to guide the generation process. Prior approaches that used retrieved images to improve generation, trained models specifically for retrieval-based generation. In contrast, ImageRAG leverages the capabilities of existing image conditioning models, and does not require RAG-specific training. Our approach is highly adaptable and can be applied across different model types, showing significant improvement in generating rare and fine-grained concepts using different base models. Our project page is available at:…
Peer Reviews
Decision·ICLR 2026 Poster
The method's adaptability is a key strength, as it seamlessly integrates with existing image-conditioning models like IP-Adapter and OminiControl, enabling broad applicability across different architectures. Extensive experiments, including quantitative metrics, qualitative examples, and human studies, robustly validate that ImageRAG consistently improves rare concept generation, with users preferring it over baselines in text alignment and visual quality.
1. The method's efficacy is highly dependent on the retrieval dataset; if it lacks relevant images (e.g., specializing in birds when generating dogs), performance may not improve, as illustrated in the retrieval data limitations. 2. It relies on the VLM's accuracy for gap identification; errors in concept detection (e.g., false positives in alignment checks) could lead to missed enhancements, though the paper notes robustness issues with some VLMs. 3. While the paper notes VLM API calls add 10
- The adaptation of RAG to text-to-image generation is inspiring. It focuses on "polish" the generated image at inference time rather than retraining. - The formulation of CoT-based retrieval is promising, which is effective for identifying the missing visual elements. - The evaluation is comprehensive, including qualitative comparisons, user study, and ablation studies. It is good to see that failure cases are included in this paper, which help the reader to better understand the limitation o
- The applicable scenario is limited. ImageRAG suits most for the scenario where rare or even weird concepts exist, such as "a boston bull", but the successfulness greatly depends on the coverage of the retrieval dataset (e.g. LAION-350K subset). It would be better to see that this method can handle well with the complex and lengthy prompts that are usually more customer-driven. - No report or measurement of the latency and computational cost of this multi-step pipeline, which is even more imp
1. The paper proposes a novel pipeline designed for image retrieval-augmented generation (RAG), which can retrieve relevant images to enhance the generative performance of diffusion models. 2. The proposed framework automates the common engineering practice of improving generation quality through reference images, demonstrating strong practical value. 3. The paper introduces an efficient approach for leveraging image knowledge database, which offers valuable insights for the development of multi
1. If the target concept to be generated does not exist in the retrieval dataset, is there a possibility that irrelevant images may be selected, thereby potentially affecting the generation quality adversely? The paper should include corresponding quantitative analyses. 2. [Retrieval-Augmented Diffusion Models] has proposed a similar concept. The paper should provide a detailed comparison with this work with same base model and dataset, and analyze advantages of ImageRAG. 3. Will the diversi
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques · Medical Image Segmentation Techniques
MethodsBalanced Selection
