Visual question answering based evaluation metrics for text-to-image   generation

Mizuki Miyamoto; Ryugo Morita; Jinjia Zhou

arXiv:2411.10183·cs.CV·November 18, 2024

Visual question answering based evaluation metrics for text-to-image generation

Mizuki Miyamoto, Ryugo Morita, Jinjia Zhou

PDF

Open Access

TL;DR

This paper introduces new evaluation metrics for text-to-image generation that utilize question generation and visual question answering to assess detailed text-image alignment and image quality.

Contribution

The paper proposes a novel evaluation framework combining question-based assessment and image quality metrics for more precise evaluation of text-to-image models.

Findings

01

The proposed metrics outperform existing methods in assessing text-image alignment.

02

The approach allows for adjustable weighting between alignment and image quality.

03

Experimental results validate the effectiveness of the new evaluation approach.

Abstract

Text-to-image generation and text-guided image manipulation have received considerable attention in the field of image generation tasks. However, the mainstream evaluation methods for these tasks have difficulty in evaluating whether all the information from the input text is accurately reflected in the generated images, and they mainly focus on evaluating the overall alignment between the input text and the generated images. This paper proposes new evaluation metrics that assess the alignment between input text and generated images for every individual object. Firstly, according to the input text, chatGPT is utilized to produce questions for the generated images. After that, we use Visual Question Answering(VQA) to measure the relevance of the generated images to the input text, which allows for a more detailed evaluation of the alignment compared to existing methods. In addition, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Human Motion and Animation

MethodsSoftmax · Attention Is All You Need · Focus