Multimodal Benchmarking and Recommendation of Text-to-Image Generation   Models

Kapil Wanaskar; Gaytri Jena; Magdalini Eirinaki

arXiv:2505.04650·cs.GR·May 9, 2025

Multimodal Benchmarking and Recommendation of Text-to-Image Generation Models

Kapil Wanaskar, Gaytri Jena, Magdalini Eirinaki

PDF

Open Access 1 Repo

TL;DR

This paper introduces an open-source benchmarking framework for text-to-image models, emphasizing metadata-augmented prompts, and demonstrates how structured metadata improves image quality and model robustness.

Contribution

It provides a unified evaluation framework using diverse metrics and shows the benefits of metadata enrichment for text-to-image generation.

Findings

01

Metadata augmentation improves visual realism.

02

Structured prompts enhance semantic fidelity.

03

Framework enables task-specific model recommendations.

Abstract

This work presents an open-source unified benchmarking and evaluation framework for text-to-image generation models, with a particular focus on the impact of metadata augmented prompts. Leveraging the DeepFashion-MultiModal dataset, we assess generated outputs through a comprehensive set of quantitative metrics, including Weighted Score, CLIP (Contrastive Language Image Pre-training)-based similarity, LPIPS (Learned Perceptual Image Patch Similarity), FID (Frechet Inception Distance), and retrieval-based measures, as well as qualitative analysis. Our results demonstrate that structured metadata enrichments greatly enhance visual realism, semantic fidelity, and model robustness across diverse text-to-image architectures. While not a traditional recommender system, our framework enables task-specific recommendations for model selection and prompt design based on evaluation metrics.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kapilw25/Evaluation_generated_images
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Data Visualization and Analytics

MethodsSparse Evolutionary Training · Focus · Contrastive Language-Image Pre-training