Is my Meeting Summary Good? Estimating Quality with a Multi-LLM Evaluator
Frederic Kirstein, Terry Ruas, Bela Gipp

TL;DR
This paper introduces MESA, a multi-LLM framework that improves the automatic evaluation of meeting summaries by better detecting errors and aligning with human judgments, reducing reliance on costly human assessments.
Contribution
MESA is a novel multi-LLM framework that employs error-specific assessment, multi-agent discussion, and self-training to enhance summary quality evaluation accuracy.
Findings
MESA achieves higher correlation with human judgment than previous methods.
The framework effectively detects nuanced errors in meeting summaries.
MESA adapts well to custom error guidelines across different tasks.
Abstract
The quality of meeting summaries generated by natural language generation (NLG) systems is hard to measure automatically. Established metrics such as ROUGE and BERTScore have a relatively low correlation with human judgments and fail to capture nuanced errors. Recent studies suggest using large language models (LLMs), which have the benefit of better context understanding and adaption of error definitions without training on a large number of human preference judgments. However, current LLM-based evaluators risk masking errors and can only serve as a weak proxy, leaving human evaluation the gold standard despite being costly and hard to compare across studies. In this work, we present MESA, an LLM-based framework employing a three-step assessment of individual error types, multi-agent discussion for decision refinement, and feedback-based self-training to refine error definition…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHealthcare Systems and Technology · Library Science and Information Systems
