A Survey of Automatic Evaluation Methods on Text, Visual and Speech Generations

Tian Lan; Yang-Hao Zhou; Zi-Ao Ma; Fanshu Sun; Rui-Qing Sun; Junyu Luo; Rong-Cheng Tu; Heyan Huang; Chen Xu; Zhijing Wu; Xian-Ling Mao

arXiv:2506.10019·cs.CL·June 13, 2025

A Survey of Automatic Evaluation Methods on Text, Visual and Speech Generations

Tian Lan, Yang-Hao Zhou, Zi-Ao Ma, Fanshu Sun, Rui-Qing Sun, Junyu Luo, Rong-Cheng Tu, Heyan Huang, Chen Xu, Zhijing Wu, Xian-Ling Mao

PDF

Open Access

TL;DR

This paper provides a comprehensive review and taxonomy of automatic evaluation methods for generated content across text, visual, and speech modalities, highlighting current approaches and future research directions.

Contribution

It introduces a unified framework and taxonomy for evaluating generated content across multiple modalities, addressing the lack of systematic organization in existing methods.

Findings

01

Identifies five fundamental evaluation paradigms across modalities.

02

Extends evaluation framework from text to images and audio.

03

Discusses promising future directions for cross-modal evaluation.

Abstract

Recent advances in deep learning have significantly enhanced generative AI capabilities across text, images, and audio. However, automatically evaluating the quality of these generated outputs presents ongoing challenges. Although numerous automatic evaluation methods exist, current research lacks a systematic framework that comprehensively organizes these methods across text, visual, and audio modalities. To address this issue, we present a comprehensive review and a unified taxonomy of automatic evaluation methods for generated content across all three modalities; We identify five fundamental paradigms that characterize existing evaluation approaches across these domains. Our analysis begins by examining evaluation methods for text generation, where techniques are most mature. We then extend this framework to image and audio generation, demonstrating its broad applicability. Finally,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Face recognition and analysis