What do we expect from Multiple-choice QA Systems?
Krunal Shah, Nitish Gupta, Dan Roth

TL;DR
This paper evaluates top MCQA models against human-like expectations using input perturbations, revealing shortcomings and proposing a new training method to improve model attention and alignment with expectations.
Contribution
It introduces a novel evaluation approach for MCQA models based on perturbations and proposes a modified training paradigm to enhance model attention and expectation alignment.
Findings
Original models fall short of expectations under perturbations.
Modified training improves model attention without sacrificing performance.
Models trained with the new paradigm better satisfy human-like expectations.
Abstract
The recent success of machine learning systems on various QA datasets could be interpreted as a significant improvement in models' language understanding abilities. However, using various perturbations, multiple recent works have shown that good performance on a dataset might not indicate performance that correlates well with human's expectations from models that "understand" language. In this work we consider a top performing model on several Multiple Choice Question Answering (MCQA) datasets, and evaluate it against a set of expectations one might have from such a model, using a series of zero-information perturbations of the model's inputs. Our results show that the model clearly falls short of our expectations, and motivates a modified training approach that forces the model to better attend to the inputs. We show that the new training paradigm leads to a model that performs on par…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
