TL;DR
This paper evaluates various summarization metrics and introduces LLM-ReSum, a self-evaluation framework that enhances summary quality without finetuning, showing significant improvements across multiple domains.
Contribution
It presents LLM-ReSum, a novel self-reflective summarization framework leveraging LLM-based evaluation and generation in a closed loop, with no model finetuning.
Findings
Traditional metrics like ROUGE have weak correlation with human judgments.
LLM-based evaluators outperform lexical overlap metrics in assessing quality.
LLM-ReSum improves factual accuracy by up to 33% and coverage by 39%.
Abstract
Reliable evaluation of large language model (LLM)-generated summaries remains an open challenge, particularly across heterogeneous domains and document lengths. We conduct a comprehensive meta-evaluation of 14 automatic summarization metrics and LLM-based evaluators across seven datasets spanning five domains, covering documents from short news articles to long scientific, governmental, and legal texts (2K-27K words) with over 1,500 human-annotated summaries. Our results show that traditional lexical overlap metrics (e.g., ROUGE, BLEU) exhibit weak or negative correlation with human judgments, while task-specific neural metrics and LLM-based evaluators achieve substantially higher alignment, especially for linguistic quality assessment. Leveraging these findings, we propose LLM-ReSum, a self-reflective summarization framework that integrates LLM-based evaluation and generation in a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
