Quality over Quantity: An Effective Large-Scale Data Reduction Strategy Based on Pointwise V-Information
Fei Chen, Wenchi Zhou

TL;DR
This paper introduces a data reduction method based on Pointwise V-Information (PVI) that maintains classifier accuracy while reducing dataset size, accelerates training, and is adaptable across languages and NLP tasks.
Contribution
It proposes a novel PVI-based data reduction strategy that effectively selects instructive data, improving training efficiency and performance, and extends PVI applicability to Chinese NLP tasks.
Findings
Maintains accuracy with only 0.0001%-0.76% data removal
Accelerates convergence with a 0.8% accuracy gain
Adapts PVI for Chinese NLP tasks
Abstract
In order to increase the effectiveness of model training, data reduction is essential to data-centric Artificial Intelligence (AI). It achieves this by locating the most instructive examples in massive datasets. To increase data quality and training efficiency, the main difficulty is choosing the best examples rather than the complete datasets. In this paper, we propose an effective data reduction strategy based on Pointwise V-Information (PVI). To enable a static method, we first use PVI to quantify instance difficulty and remove instances with low difficulty. Experiments show that classifier performance is maintained with only a 0.0001% to 0.76% decline in accuracy when 10%-30% of the data is removed. Second, we train the classifiers using a progressive learning strategy on examples sorted by increasing PVI, accelerating convergence and achieving a 0.8% accuracy gain over conventional…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Compression Techniques · Rough Sets and Fuzzy Logic · Statistical Methods and Inference
