Open Multimodal Retrieval-Augmented Factual Image Generation

Yang Tian; Fan Liu; Jingyuan Zhang; Wei Bi; Yupeng Hu; Liqiang Nie

arXiv:2510.22521·cs.CV·October 28, 2025

Open Multimodal Retrieval-Augmented Factual Image Generation

Yang Tian, Fan Liu, Jingyuan Zhang, Wei Bi, Yupeng Hu, Liqiang Nie

PDF

1 Datasets

TL;DR

This paper introduces ORIG, a retrieval-augmented framework for generating photorealistic images that are factually accurate, addressing limitations of existing models by iteratively retrieving and integrating web evidence.

Contribution

The paper proposes ORIG, a novel open multimodal retrieval-augmented approach for factual image generation, along with a new benchmark FIG-Eval for systematic evaluation.

Findings

01

ORIG significantly improves factual consistency in generated images.

02

ORIG outperforms strong baselines in image quality and factual accuracy.

03

The approach demonstrates potential for open knowledge integration in image synthesis.

Abstract

Large Multimodal Models (LMMs) have achieved remarkable progress in generating photorealistic and prompt-aligned images, but they often produce outputs that contradict verifiable knowledge, especially when prompts involve fine-grained attributes or time-sensitive events. Conventional retrieval-augmented approaches attempt to address this issue by introducing external information, yet they are fundamentally incapable of grounding generation in accurate and evolving knowledge due to their reliance on static sources and shallow evidence integration. To bridge this gap, we introduce ORIG, an agentic open multimodal retrieval-augmented framework for Factual Image Generation (FIG), a new task that requires both visual realism and factual grounding. ORIG iteratively retrieves and filters multimodal evidence from the web and incrementally integrates the refined knowledge into enriched prompts…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

TyangJN/FIG
dataset· 35 dl
35 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.