Evaluating Generative Language Models in Information Extraction as Subjective Question Correction
Yuchen Fan, Yantao Liu, Zijun Yao, Jifan Yu, Lei Hou, Juanzi Li

TL;DR
This paper introduces SQC-Score, a novel evaluation method leveraging LLMs and NLI to better assess information extraction tasks, addressing existing metric limitations and benchmark incompleteness, leading to more accurate performance evaluation.
Contribution
The paper proposes SQC-Score, an innovative evaluation approach that improves semantic matching and acknowledges omitted answers, enhancing LLM performance assessment in information extraction.
Findings
SQC-Score is preferred by human annotators over baseline metrics.
SQC-Score provides more accurate evaluation of LLMs in information extraction.
The method addresses both metric imprecision and benchmark incompleteness.
Abstract
Modern Large Language Models (LLMs) have showcased remarkable prowess in various tasks necessitating sophisticated cognitive behaviors. Nevertheless, a paradoxical performance discrepancy is observed, where these models underperform in seemingly elementary tasks like relation extraction and event extraction due to two issues in conventional evaluation. (1) The imprecision of existing evaluation metrics that struggle to effectively gauge semantic consistency between model outputs and ground truth, and (2) The inherent incompleteness of evaluation benchmarks, primarily due to restrictive human annotation schemas, resulting in underestimated LLM performances. Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score. This method innovatively utilizes LLMs, fine-tuned through subjective question correction data, to refine matching between…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems
