Improving Model Evaluation using SMART Filtering of Benchmark Datasets
Vipul Gupta, Candace Ross, David Pantoja, Rebecca J. Passonneau, Megan, Ung, Adina Williams

TL;DR
This paper introduces SMART filtering, a method to improve NLP model evaluation by selecting high-quality, challenging, and diverse test examples, reducing dataset size while maintaining ranking accuracy.
Contribution
The paper presents SMART filtering, a novel systematic approach to select high-quality benchmark examples, addressing saturation, contamination, and diversity issues in NLP evaluation datasets.
Findings
Reduces dataset size by 48% on average
Increases correlation with human rankings
Enhances evaluation efficiency and robustness
Abstract
One of the most challenging problems facing NLP today is evaluation. Some of the most pressing issues pertain to benchmark saturation, data contamination, and diversity in the quality of test examples. To address these concerns, we propose Selection Methodology for Accurate, Reduced, and Targeted (SMART) filtering, a novel approach to select a high-quality subset of examples from existing benchmark datasets by systematically removing less informative and less challenging examples. Our approach applies three filtering criteria, removing (i) easy examples, (ii) data-contaminated examples, and (iii) examples that are similar to each other based on distance in an embedding space. We demonstrate the effectiveness of SMART on three multiple choice QA datasets, where our methodology increases efficiency by reducing dataset size by 48\% on average, while increasing Pearson correlation with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNeural Networks and Applications
