Open Multimodal Retrieval-Augmented Factual Image Generation
Yang Tian, Fan Liu, Jingyuan Zhang, Wei Bi, Yupeng Hu, Liqiang Nie

TL;DR
This paper introduces ORIG, a retrieval-augmented framework for generating photorealistic images that are factually accurate, addressing limitations of existing models by iteratively retrieving and integrating web evidence.
Contribution
The paper proposes ORIG, a novel open multimodal retrieval-augmented approach for factual image generation, along with a new benchmark FIG-Eval for systematic evaluation.
Findings
ORIG significantly improves factual consistency in generated images.
ORIG outperforms strong baselines in image quality and factual accuracy.
The approach demonstrates potential for open knowledge integration in image synthesis.
Abstract
Large Multimodal Models (LMMs) have achieved remarkable progress in generating photorealistic and prompt-aligned images, but they often produce outputs that contradict verifiable knowledge, especially when prompts involve fine-grained attributes or time-sensitive events. Conventional retrieval-augmented approaches attempt to address this issue by introducing external information, yet they are fundamentally incapable of grounding generation in accurate and evolving knowledge due to their reliance on static sources and shallow evidence integration. To bridge this gap, we introduce ORIG, an agentic open multimodal retrieval-augmented framework for Factual Image Generation (FIG), a new task that requires both visual realism and factual grounding. ORIG iteratively retrieves and filters multimodal evidence from the web and incrementally integrates the refined knowledge into enriched prompts…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
