Improving Model Evaluation using SMART Filtering of Benchmark Datasets

Vipul Gupta; Candace Ross; David Pantoja; Rebecca J. Passonneau; Megan; Ung; Adina Williams

arXiv:2410.20245·cs.CL·February 12, 2025

Improving Model Evaluation using SMART Filtering of Benchmark Datasets

Vipul Gupta, Candace Ross, David Pantoja, Rebecca J. Passonneau, Megan, Ung, Adina Williams

PDF

Open Access 1 Repo 3 Datasets 1 Video

TL;DR

This paper introduces SMART filtering, a method to improve NLP model evaluation by selecting high-quality, challenging, and diverse test examples, reducing dataset size while maintaining ranking accuracy.

Contribution

The paper presents SMART filtering, a novel systematic approach to select high-quality benchmark examples, addressing saturation, contamination, and diversity issues in NLP evaluation datasets.

Findings

01

Reduces dataset size by 48% on average

02

Increases correlation with human rankings

03

Enhances evaluation efficiency and robustness

Abstract

One of the most challenging problems facing NLP today is evaluation. Some of the most pressing issues pertain to benchmark saturation, data contamination, and diversity in the quality of test examples. To address these concerns, we propose Selection Methodology for Accurate, Reduced, and Targeted (SMART) filtering, a novel approach to select a high-quality subset of examples from existing benchmark datasets by systematically removing less informative and less challenging examples. Our approach applies three filtering criteria, removing (i) easy examples, (ii) data-contaminated examples, and (iii) examples that are similar to each other based on distance in an embedding space. We demonstrate the effectiveness of SMART on three multiple choice QA datasets, where our methodology increases efficiency by reducing dataset size by 48\% on average, while increasing Pearson correlation with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

facebookresearch/responsiblenlp
none

Datasets

Videos

Improving Model Evaluation using SMART Filtering of Benchmark Datasets· underline

Taxonomy

TopicsNeural Networks and Applications