KITTEN: A Knowledge-Intensive Evaluation of Image Generation on Visual Entities
Hsin-Ping Huang, Xinyi Wang, Yonatan Bitton, Hagai Taitelbaum, Gaurav Singh Tomar, Ming-Wei Chang, Xuhui Jia, Kelvin C.K. Chan, Hexiang Hu, Yu-Chuan Su, Ming-Hsuan Yang

TL;DR
KITTEN is a benchmark designed to evaluate how well current text-to-image models can accurately generate detailed visual entities from real-world knowledge, revealing limitations in their fidelity and creativity.
Contribution
This paper introduces KITTEN, a novel benchmark for knowledge-intensive image generation, and provides a systematic evaluation of state-of-the-art models' capabilities and limitations.
Findings
Advanced models often fail to capture detailed visual features.
Retrieval-augmented models improve fidelity but over-rely on reference images.
Models struggle with generating novel configurations of entities.
Abstract
Recent advances in text-to-image generation have improved the quality of synthesized images, but evaluations mainly focus on aesthetics or alignment with text prompts. Thus, it remains unclear whether these models can accurately represent a wide variety of realistic visual entities. To bridge this gap, we propose KITTEN, a benchmark for Knowledge-InTensive image generaTion on real-world ENtities. Using KITTEN, we conduct a systematic study of the latest text-to-image models and retrieval-augmented models, focusing on their ability to generate real-world visual entities, such as landmarks and animals. Analysis using carefully designed human evaluations, automatic metrics, and MLLM evaluations show that even advanced text-to-image models fail to generate accurate visual details of entities. While retrieval-augmented models improve entity fidelity by incorporating reference images, they…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
- This paper tackles a unique and underexplored aspect in the text-to-image generation community --evaluating the fidelity of specific entities in generated images. The motivation and evaluation framework are well-grounded. - Utilizing Wikipedia as a knowledge base is a clever choice, ensuring broad coverage and relevance of entities. The idea makes the entity coverage more scalable for future extension. - The proposed evaluation approach, incorporating both human and automatic metrics, strength
- While the study suggests a direction for future work aimed at balancing entity fidelity with creative flexibility, the notion remains somewhat ambiguous and challenging to envision in practice. Achieving an optimal outcome where both precise entity representation and creative interpretation coexist is complex, and the paper could benefit from a clearer exploration or concrete examples of what such a balance might look like in generated images. - The paper suggests that the optimal image genera
the focus of the paper is interesting, which is to evaluate the text-to-image model in terms of generating real-world entities with some modifications. the benchmark would also be beneficial to the community. the evaluation is based on two aspects. one is the faithfulness to the entity, and the other is the prompt following as the prompt may contain some changes to the entity. the metric is sound and reasonable. beyond, the authors perform a very comprehensive study on how the existing approache
as the main contribution is the benchmark, it may be better to provide more details on how the benchmark is collected. throughout the paper, i only find the information that the benchmark is collected from 8 categories and Table 7 in supl shows the number of prompts in each. it might be recommended to share more details, e.g. how these categories are collected, why it is these 8 categories, is it diverse enough, how the prompt is collected, who collected the prompts, what the prompt length and v
1. The paper presents a novel problem in the field of image generation, focusing on the factuality of real-world knowledge in image generation, which is rarely explored in the existing literature. 2. The results are clearly presented, with sufficient tables and visual images to illustrate the findings. 3. The paper is well-written, with a clear and logical organization.
1. Insufficient Experimental Setup: The current experiments are limited to real-world entities in only 8 domains and 322 knowledge entities, which lacks diversity. This number is significantly less than the variety of entities present in the real world and does not comprehensively reflect the model's performance, potentially leading to data bias. The OntoNotes and WordNet datasets include a more diverse range of categories and entities. If you only consider your 8 domains and limited number of e
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques · Data Visualization and Analytics · Video Analysis and Summarization
MethodsFocus
