TL;DR
This paper introduces new transformer-based metrics for evaluating semantic answer similarity in question-answering systems, demonstrating improved correlation with human judgment and providing a novel dataset of co-referent name pairs.
Contribution
It proposes cross-encoder augmented bi-encoder and BERTScore models for semantic similarity, and releases the first dataset of co-referent name pairs for training and evaluation.
Findings
Models achieve higher correlation with human judgments.
New dataset of co-referent name pairs is introduced.
Semantic metrics outperform string overlap in answer evaluation.
Abstract
There are several issues with the existing general machine translation or natural language generation evaluation metrics, and question-answering (QA) systems are indifferent in that context. To build robust QA systems, we need the ability to have equivalently robust evaluation systems to verify whether model predictions to questions are similar to ground-truth annotations. The ability to compare similarity based on semantics as opposed to pure string overlap is important to compare models fairly and to indicate more realistic acceptance criteria in real-life applications. We build upon the first to our knowledge paper that uses transformer-based model metrics to assess semantic answer similarity and achieve higher correlations to human judgement in the case of no lexical overlap. We propose cross-encoder augmented bi-encoder and BERTScore models for semantic answer similarity, trained…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
