Judge Like Human Examiners: A Weighted Importance Multi-Point Evaluation Framework for Generative Tasks with Long-form Answers

Guoxin Yu; Chulun Zhou; Lemao Liu; Qi Wang; Mo Yu; Jialong Tang; Baosong Yang; Xiang Ao; Wai Lam; Yue Yu

arXiv:2604.11246·cs.CL·April 15, 2026

Judge Like Human Examiners: A Weighted Importance Multi-Point Evaluation Framework for Generative Tasks with Long-form Answers

Guoxin Yu, Chulun Zhou, Lemao Liu, Qi Wang, Mo Yu, Jialong Tang, Baosong Yang, Xiang Ao, Wai Lam, Yue Yu

PDF

TL;DR

This paper introduces WIMPE, a novel evaluation framework for long-form generative responses that assesses multiple aspects with weighted importance, aligning better with human judgments.

Contribution

The paper proposes a weighted multi-point evaluation framework with new metrics, improving the assessment of model responses in complex generative tasks.

Findings

01

WIMPE achieves higher correlation with human annotations across 10 tasks.

02

The framework effectively captures the importance of different answer aspects.

03

WIMPE improves fine-grained evaluation of long-form generative responses.

Abstract

Evaluating the quality of model responses remains challenging in generative tasks with long-form answers, as the expected answers usually contain multiple semantically distinct yet complementary factors that should be factorized for fine-grained assessment. Recent evaluation methods resort to relying on either task-level rubrics or question-aware checklists. However, they still 1) struggle to assess whether a response is genuinely grounded in provided contexts; 2) fail to capture the heterogeneous importance of different aspects of reference answers. Inspired by human examiners, we propose a Weighted Importance Multi-Point Evaluation (WIMPE) framework, which factorizes each reference answer into weighted context-bound scoring points. Two complementary metrics, namely Weighted Point-wise Alignment (WPA) and Point-wise Conflict Penalty (PCP), are designed to measure the alignment and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.