The Validity of Coreference-based Evaluations of Natural Language Understanding

Ian Porada

arXiv:2602.16200·cs.CL·February 19, 2026

The Validity of Coreference-based Evaluations of Natural Language Understanding

Ian Porada

PDF

Open Access

TL;DR

This paper critically examines coreference-based evaluations of NLP systems, revealing their limitations in measurement validity and proposing a new evaluation method to better assess models' understanding of event plausibility.

Contribution

It identifies issues with existing coreference evaluations and introduces a novel evaluation focusing on event plausibility inference to improve assessment accuracy.

Findings

01

Standard benchmarks show models perform well but lack generalization.

02

Models are sensitive to evaluation conditions and often fail to generalize.

03

Current evaluation methods have validity issues, limiting conclusions.

Abstract

In this thesis, I refine our understanding as to what conclusions we can reach from coreference-based evaluations by expanding existing evaluation practices and considering the extent to which evaluation results are either converging or conflicting. First, I analyze standard coreference evaluations and show that their design often leads to non-generalizable conclusions due to issues of measurement validity - including contestedness (multiple, competing definitions of coreference) and convergent validity (evaluation results that rank models differently across benchmarks). Second, I propose and implement a novel evaluation focused on testing systems' ability to infer the relative plausibility of events, a key aspect of resolving coreference. Through this extended evaluation, I find that contemporary language models demonstrate strong performance on standard benchmarks - improving over…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems