Calibrating Model-Based Evaluation Metrics for Summarization
Hongye Liu, Dhanajit Brahma, Ricardo Henao

TL;DR
This paper introduces a calibration framework for model-based summarization metrics that generates reliable scores without references or human annotations, improving evaluation accuracy across multiple datasets.
Contribution
It proposes a novel calibration method, GIRB, that enhances the reliability of model-based evaluation metrics for summarization without requiring references or expensive models.
Findings
GIRB improves calibration of evaluation scores across seven datasets.
The framework generates individual and average scores without references.
Experiments show consistent outperformance over existing baselines.
Abstract
Recent advances in summary evaluation are based on model-based metrics to assess quality dimensions, such as completeness, conciseness, and faithfulness. However, these methods often require large language models, and predicted scores are frequently miscalibrated, limiting their reliability. Moreover, evaluating the average quality across different summaries for a single document typically requires access to multiple reference summaries. Here, we propose a general framework that generates individual and average proxy scores without relying on reference summaries, human annotations, or expensive model-based metrics. We also propose group isotonic regression binning (GIRB), a calibration method that adjusts the raw predictions to better align with ground-truth evaluation metrics. While we focus on continuous-value scenarios, such as summarization, the method is applicable to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
