Hardness of Samples Need to be Quantified for a Reliable Evaluation   System: Exploring Potential Opportunities with a New Task

Swaroop Mishra; Anjana Arunkumar; Chris Bryan; Chitta Baral

arXiv:2210.07631·cs.CL·October 17, 2022·1 cites

Hardness of Samples Need to be Quantified for a Reliable Evaluation System: Exploring Potential Opportunities with a New Task

Swaroop Mishra, Anjana Arunkumar, Chris Bryan, Chitta Baral

PDF

Open Access

TL;DR

This paper introduces a new task to quantify sample hardness in benchmarks, aiming to improve evaluation reliability by assigning difficulty scores without model supervision, and demonstrates its applications and validation methods.

Contribution

A novel Data Scoring task that estimates sample difficulty without model supervision, reducing bias and enhancing evaluation accuracy in AI benchmarks.

Findings

01

Existing models perform better on easier samples

02

Proposed STS-based method effectively predicts sample difficulty

03

Five applications demonstrate practical utility

Abstract

Evaluation of models on benchmarks is unreliable without knowing the degree of sample hardness; this subsequently overestimates the capability of AI systems and limits their adoption in real world applications. We propose a Data Scoring task that requires assignment of each unannotated sample in a benchmark a score between 0 to 1, where 0 signifies easy and 1 signifies hard. Use of unannotated samples in our task design is inspired from humans who can determine a question difficulty without knowing its correct answer. This also rules out the use of methods involving model based supervision (since they require sample annotations to get trained), eliminating potential biases associated with models in deciding sample difficulty. We propose a method based on Semantic Textual Similarity (STS) for this task; we validate our method by showing that existing models are more accurate with respect…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling · Software Engineering Research