Semantic Answer Similarity for Evaluating Question Answering Models
Julian Risch, Timo M\"oller, Julian Gutsch, Malte Pietsch

TL;DR
This paper introduces SAS, a semantic answer similarity metric using cross-encoders, which better aligns with human judgment than traditional lexical metrics, improving evaluation of question answering models.
Contribution
The paper presents SAS, a novel transformer-based semantic similarity metric for answer evaluation, along with annotated datasets in English and German for benchmarking.
Findings
Semantic metrics correlate better with human judgment than lexical metrics.
SAS outperforms existing similarity metrics in evaluation tasks.
Created and released bilingual datasets with human-annotated answer pairs.
Abstract
The evaluation of question answering models compares ground-truth annotations with model predictions. However, as of today, this comparison is mostly lexical-based and therefore misses out on answers that have no lexical overlap but are still semantically similar, thus treating correct answers as false. This underestimation of the true performance of models hinders user acceptance in applications and complicates a fair comparison of different models. Therefore, there is a need for an evaluation metric that is based on semantics instead of pure string similarity. In this short paper, we present SAS, a cross-encoder-based metric for the estimation of semantic answer similarity, and compare it to seven existing metrics. To this end, we create an English and a German three-way annotated evaluation dataset containing pairs of answers along with human judgment of their semantic similarity,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
