Check-Eval: A Checklist-based Approach for Evaluating Text Quality
Jayr Pereira, Andre Assumpcao, Roberto Lotufo

TL;DR
Check-Eval introduces a checklist-based framework leveraging large language models to evaluate generated text quality, achieving higher correlation with human judgments than existing metrics across benchmark datasets.
Contribution
It presents a novel, structured evaluation method combining checklist generation and assessment, improving alignment with human evaluations for natural language generation.
Findings
Outperforms existing metrics like G-Eval and GPTScore in correlation with human judgments.
Works effectively as both reference-free and reference-dependent evaluation.
Validated on Portuguese Legal Semantic Textual Similarity and SummEval datasets.
Abstract
Evaluating the quality of text generated by large language models (LLMs) remains a significant challenge. Traditional metrics often fail to align well with human judgments, particularly in tasks requiring creativity and nuance. In this paper, we propose \textsc{Check-Eval}, a novel evaluation framework leveraging LLMs to assess the quality of generated text through a checklist-based approach. \textsc{Check-Eval} can be employed as both a reference-free and reference-dependent evaluation method, providing a structured and interpretable assessment of text quality. The framework consists of two main stages: checklist generation and checklist evaluation. We validate \textsc{Check-Eval} on two benchmark datasets: Portuguese Legal Semantic Textual Similarity and \textsc{SummEval}. Our results demonstrate that \textsc{Check-Eval} achieves higher correlations with human judgments compared to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPharmacy and Medical Practices
MethodsALIGN
