Is your benchmark truly adversarial? AdvScore: Evaluating Human-Grounded   Adversarialness

Yoo Yeon Sung; Maharshi Gor; Eve Fleisig; Ishani Mondal; Jordan Lee; Boyd-Graber

arXiv:2406.16342·cs.CL·February 20, 2025

Is your benchmark truly adversarial? AdvScore: Evaluating Human-Grounded Adversarialness

Yoo Yeon Sung, Maharshi Gor, Eve Fleisig, Ishani Mondal, Jordan Lee, Boyd-Graber

PDF

Open Access 1 Video

TL;DR

This paper introduces AdvScore, a human-grounded metric for evaluating the adversarialness of datasets, and uses it to develop AdvQA, a high-quality adversarial question answering dataset, tracking model progress over four years.

Contribution

The paper proposes AdvScore, a standardized metric for measuring dataset adversarialness, and demonstrates its use in creating and evaluating a new adversarial QA dataset, AdvQA.

Findings

01

AdvScore effectively tracks model improvement over time.

02

AdvQA provides high-quality, realistic adversarial samples.

03

AdvScore helps ensure datasets remain challenging and relevant.

Abstract

Adversarial datasets should validate AI robustness by providing samples on which humans perform well, but models do not. However, as models evolve, datasets can become obsolete. Measuring whether a dataset remains adversarial is hindered by the lack of a standardized metric for measuring adversarialness. We propose AdvScore, a human-grounded evaluation metric that assesses a dataset's adversarialness by capturing models' and humans' varying abilities while also identifying poor examples. We then use AdvScore to motivate a new dataset creation pipeline for realistic and high-quality adversarial samples, enabling us to collect an adversarial question answering (QA) dataset, AdvQA. We apply AdvScore using 9,347 human responses and ten language models' predictions to track model improvement over five years, from 2020 to 2024. AdvScore thus provides guidance for achieving robustness…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Is your benchmark truly adversarial? AdvScore: Evaluating Human-Grounded Adversarialness· underline

Taxonomy

TopicsAdvanced Malware Detection Techniques · Adversarial Robustness in Machine Learning