FAGER: Factually Grounded Evaluation and Refinement of Text-to-Image Models
Youngsun Lim, Cusuh Ham, Pin-Yu Chen, Deepti Ghadiyaram

TL;DR
FAGER is a new framework for evaluating and improving text-to-image models by focusing on factual correctness grounded in prompts, outperforming previous metrics across diverse datasets.
Contribution
FAGER introduces a structured factual evaluation method combining LLMs and visual verification, and enables training-free refinement of generated images for enhanced factual accuracy.
Findings
FAGER outperforms prior metrics in factuality preference tests across multiple datasets.
FAGER can refine T2I outputs without additional training, improving factual correctness.
The framework effectively evaluates factual grounding in science, history, products, and culture contexts.
Abstract
Existing text-to-image (T2I) evaluation metrics mainly assess whether generated images align with information explicitly stated in the prompt, but often fail to capture factual requirements that are implicit, externally grounded, or identity-defining. As a result, they are not well suited for evaluating factual correctness in prompts involving scientific knowledge, historical facts, products, or culture-specific concepts. We propose FActually Grounded Evaluation and Refinement (FAGER), an agentic framework that evaluates whether generated images correctly reflect visually verifiable facts grounded in or implied by the prompt, while also providing actionable feedback for improvement. FAGER first constructs a structured factual rubric by combining LLM-based fact proposal with reference-guided visual fact extraction and verification, then converts the rubric into question-answer pairs for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
