SCARV: Structure-Constrained Aggregation for Stable Sample Ranking in Redundant NLP Datasets
Xu Zheng, Feiyu Wu, Linhong Wu, Zhuocheng Wang, Hui Li

TL;DR
SCARV is a modular framework that enhances the stability of sample rankings in redundant NLP datasets by combining multi-seed aggregation with structure-aware clustering, improving reproducibility of data-centric NLP tasks.
Contribution
It introduces SCARV, a novel aggregation method that improves the stability and reproducibility of sample rankings in NLP datasets with redundancy, outperforming simple proxy rankings.
Findings
SCARV significantly improves stability across various NLP tasks and datasets.
Multi-seed aggregation is the key stabilizer in the proposed framework.
Structure-aware aggregation adds value under low budgets or with informative redundancy clusters.
Abstract
Sample-level rankings are increasingly used in data-centric NLP for analysis, filtering, debugging, and curation, yet existing pipelines typically score training examples pointwise and rank them as if they were independent. This assumption is fragile in the presence of exact duplicates, near-duplicates, paraphrases, and other redundant structure common in NLP corpora, where stochastic training can make highly similar examples receive unstable relative orderings across random seeds. We study stable sample-level ranking under redundancy and propose \textsc{SCARV}, a modular aggregation framework that operates on top of an existing scoring proxy. \textsc{SCARV} combines robust multi-seed aggregation with a structure-aware aggregation/allocation step over redundancy clusters. Across synthetic redundancy, naturally mined QQP redundancy, multiple proxy families, several NLP tasks, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
