What Can Secondary Predictions Tell Us? An Exploration on Question-Answering with SQuAD-v2.0
Michael Kamfonas, Gabriel Alon

TL;DR
This paper introduces the Golden Rank (GR) and Golden Rank Interpolated Median (GRIM) metrics to analyze secondary predictions in question-answering models, revealing insights into model confidence and question difficulty beyond traditional accuracy measures.
Contribution
The paper proposes new metrics, GR and GRIM, for analyzing secondary predictions in QA models, providing novel insights into model confidence and question difficulty.
Findings
Most secondary predictions close to top rank in models analyzed.
GRIM not correlated with F1 and EM scores.
Metrics useful for error analysis and training diagnostics.
Abstract
Performance in natural language processing, and specifically for the question-answer task, is typically measured by comparing a model\'s most confident (primary) prediction to golden answers (the ground truth). We are making the case that it is also useful to quantify how close a model came to predicting a correct answer even for examples that failed. We define the Golden Rank (GR) of an example as the rank of its most confident prediction that exactly matches a ground truth, and show why such a match always exists. For the 16 transformer models we analyzed, the majority of exactly matched golden answers in secondary prediction space hover very close to the top rank. We refer to secondary predictions as those ranking above 0 in descending confidence probability order. We demonstrate how the GR can be used to classify questions and visualize their spectrum of difficulty, from persistent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Seismology and Earthquake Studies
MethodsTest
