The Validity of Coreference-based Evaluations of Natural Language Understanding
Ian Porada

TL;DR
This paper critically examines coreference-based evaluations of NLP systems, revealing their limitations in measurement validity and proposing a new evaluation method to better assess models' understanding of event plausibility.
Contribution
It identifies issues with existing coreference evaluations and introduces a novel evaluation focusing on event plausibility inference to improve assessment accuracy.
Findings
Standard benchmarks show models perform well but lack generalization.
Models are sensitive to evaluation conditions and often fail to generalize.
Current evaluation methods have validity issues, limiting conclusions.
Abstract
In this thesis, I refine our understanding as to what conclusions we can reach from coreference-based evaluations by expanding existing evaluation practices and considering the extent to which evaluation results are either converging or conflicting. First, I analyze standard coreference evaluations and show that their design often leads to non-generalizable conclusions due to issues of measurement validity - including contestedness (multiple, competing definitions of coreference) and convergent validity (evaluation results that rank models differently across benchmarks). Second, I propose and implement a novel evaluation focused on testing systems' ability to infer the relative plausibility of events, a key aspect of resolving coreference. Through this extended evaluation, I find that contemporary language models demonstrate strong performance on standard benchmarks - improving over…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
