A Comparative Study of Quality Evaluation Methods for Text Summarization
Huyen Nguyen, Haihua Chen, Lavanya Pobbathi, Junhua Ding

TL;DR
This paper introduces a new LLM-based method for evaluating text summarization, demonstrating it aligns more closely with human judgment than traditional automatic metrics across patent datasets.
Contribution
The paper presents a novel LLM-based evaluation approach and provides a comprehensive comparison with existing metrics and human assessments.
Findings
LLM evaluation aligns closely with human judgment
Traditional metrics like ROUGE-2 and BERTScore lack consistency
Proposed framework improves automatic evaluation of summarization
Abstract
Evaluating text summarization has been a challenging task in natural language processing (NLP). Automatic metrics which heavily rely on reference summaries are not suitable in many situations, while human evaluation is time-consuming and labor-intensive. To bridge this gap, this paper proposes a novel method based on large language models (LLMs) for evaluating text summarization. We also conducts a comparative study on eight automatic metrics, human evaluation, and our proposed LLM-based method. Seven different types of state-of-the-art (SOTA) summarization models were evaluated. We perform extensive experiments and analysis on datasets with patent documents. Our results show that LLMs evaluation aligns closely with human evaluation, while widely-used automatic metrics such as ROUGE-2, BERTScore, and SummaC do not and also lack consistency. Based on the empirical comparison, we propose…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Text Analysis Techniques · Topic Modeling · Data Quality and Management
MethodsSoftmax · Attention Is All You Need
