Think Twice: Measuring the Efficiency of Eliminating Prediction   Shortcuts of Question Answering Models

Luk\'a\v{s} Mikula; Michal \v{S}tef\'anik; Marek Petrovi\v{c}; Petr; Sojka

arXiv:2305.06841·cs.CL·February 7, 2024·1 cites

Think Twice: Measuring the Efficiency of Eliminating Prediction Shortcuts of Question Answering Models

Luk\'a\v{s} Mikula, Michal \v{S}tef\'anik, Marek Petrovi\v{c}, Petr, Sojka

PDF

Open Access 1 Repo

TL;DR

This paper introduces a method to measure how much question answering models depend on spurious dataset biases, revealing that current debiasing techniques do not fully address shared biases across datasets, impacting model robustness evaluation.

Contribution

The authors propose a simple, scalable method to quantify models' reliance on known and new spurious features in QA, highlighting shared biases and limitations of existing debiasing approaches.

Findings

01

Debiasing methods reduce reliance on targeted spurious features.

02

OOD performance improvements are not solely due to bias mitigation.

03

Models trained on different datasets depend on similar bias features.

Abstract

While the Large Language Models (LLMs) dominate a majority of language understanding tasks, previous work shows that some of these results are supported by modelling spurious correlations of training datasets. Authors commonly assess model robustness by evaluating their models on out-of-distribution (OOD) datasets of the same task, but these datasets might share the bias of the training dataset. We propose a simple method for measuring a scale of models' reliance on any identified spurious feature and assess the robustness towards a large set of known and newly found prediction biases for various pre-trained models and debiasing methods in Question Answering (QA). We find that while existing debiasing methods can mitigate reliance on a chosen spurious feature, the OOD performance gains of these methods can not be explained by mitigated reliance on biased features, suggesting that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mir-mu/isbiased
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications