'Tis but Thy Name: Semantic Question Answering Evaluation with 11M Names for 1M Entities
Albert Huang

TL;DR
This paper introduces WES, a large-scale semantic entity similarity dataset from Wikipedia, to improve QA evaluation metrics by better capturing semantic correctness over lexical matching.
Contribution
The paper presents WES, an 11 million example dataset for semantic entity similarity, tailored for QA evaluation, and demonstrates its effectiveness over traditional metrics.
Findings
WES dataset aligns well with human judgments.
A basic cross encoder outperforms classic metrics in predicting correctness.
WES enables more accurate semantic evaluation in QA systems.
Abstract
Classic lexical-matching-based QA metrics are slowly being phased out because they punish succinct or informative outputs just because those answers were not provided as ground truth. Recently proposed neural metrics can evaluate semantic similarity but were trained on small textual similarity datasets grafted from foreign domains. We introduce the Wiki Entity Similarity (WES) dataset, an 11M example, domain targeted, semantic entity similarity dataset that is generated from link texts in Wikipedia. WES is tailored to QA evaluation: the examples are entities and phrases and grouped into semantic clusters to simulate multiple ground-truth labels. Human annotators consistently agree with WES labels, and a basic cross encoder metric is better than four classic metrics at predicting human judgments of correctness.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
