A Novel Evaluation Framework for Image2Text Generation

Jia-Hong Huang; Hongyi Zhu; Yixian Shen; Stevan Rudinac; Alessio M.; Pacces; Evangelos Kanoulas

arXiv:2408.01723·cs.CV·August 6, 2024·2 cites

A Novel Evaluation Framework for Image2Text Generation

Jia-Hong Huang, Hongyi Zhu, Yixian Shen, Stevan Rudinac, Alessio M., Pacces, Evangelos Kanoulas

PDF

Open Access

TL;DR

This paper introduces a novel image captioning evaluation framework using large language models and image generation, which correlates well with human judgment without needing reference captions.

Contribution

The proposed framework leverages LLMs to generate images from captions and assess similarity, eliminating the need for human-annotated references and improving evaluation accuracy.

Findings

01

High correlation with human judgment confirmed

02

Effective without reference captions

03

Outperforms traditional metrics

Abstract

Evaluating the quality of automatically generated image descriptions is challenging, requiring metrics that capture various aspects such as grammaticality, coverage, correctness, and truthfulness. While human evaluation offers valuable insights, its cost and time-consuming nature pose limitations. Existing automated metrics like BLEU, ROUGE, METEOR, and CIDEr aim to bridge this gap but often show weak correlations with human judgment. We address this challenge by introducing a novel evaluation framework rooted in a modern large language model (LLM), such as GPT-4 or Gemini, capable of image generation. In our proposed framework, we begin by feeding an input image into a designated image captioning model, chosen for evaluation, to generate a textual description. Using this description, an LLM then creates a new image. By extracting features from both the original and LLM-created images,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimedia Communication and Technology · Video Analysis and Summarization · Image Retrieval and Classification Techniques

MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Multi-Head Attention · Position-Wise Feed-Forward Layer · Adam · Byte Pair Encoding · Softmax · Absolute Position Encodings · Dense Connections