FineSurE: Fine-grained Summarization Evaluation using LLMs
Hwanjun Song, Hang Su, Igor Shalyminov, Jason Cai, Saab Mansour

TL;DR
FineSurE introduces a multi-dimensional, fine-grained evaluation method for text summarization using LLMs, addressing limitations of existing metrics by assessing faithfulness, completeness, and conciseness at the sentence level.
Contribution
It presents a novel LLM-based evaluator that provides detailed, multi-dimensional assessment of summaries, improving upon existing summary-level metrics.
Findings
Outperforms state-of-the-art methods on completeness and conciseness.
Enables sentence-level hallucination detection.
Demonstrates versatility across various LLM backbones.
Abstract
Automated evaluation is crucial for streamlining text summarization benchmarking and model development, given the costly and time-consuming nature of human evaluation. Traditional methods like ROUGE do not correlate well with human judgment, while recently proposed LLM-based metrics provide only summary-level assessment using Likert-scale scores. This limits deeper model analysis, e.g., we can only assign one hallucination score at the summary level, while at the sentence level, we can count sentences containing hallucinations. To remedy those limitations, we propose FineSurE, a fine-grained evaluator specifically tailored for the summarization task using large language models (LLMs). It also employs completeness and conciseness criteria, in addition to faithfulness, enabling multi-dimensional assessment. We compare various open-source and proprietary LLMs as backbones for FineSurE. In…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Data Quality and Management
