Quality over Quantity: An Effective Large-Scale Data Reduction Strategy Based on Pointwise V-Information

Fei Chen; Wenchi Zhou

arXiv:2507.00038·cs.LG·August 11, 2025

Quality over Quantity: An Effective Large-Scale Data Reduction Strategy Based on Pointwise V-Information

Fei Chen, Wenchi Zhou

PDF

Open Access

TL;DR

This paper introduces a data reduction method based on Pointwise V-Information (PVI) that maintains classifier accuracy while reducing dataset size, accelerates training, and is adaptable across languages and NLP tasks.

Contribution

It proposes a novel PVI-based data reduction strategy that effectively selects instructive data, improving training efficiency and performance, and extends PVI applicability to Chinese NLP tasks.

Findings

01

Maintains accuracy with only 0.0001%-0.76% data removal

02

Accelerates convergence with a 0.8% accuracy gain

03

Adapts PVI for Chinese NLP tasks

Abstract

In order to increase the effectiveness of model training, data reduction is essential to data-centric Artificial Intelligence (AI). It achieves this by locating the most instructive examples in massive datasets. To increase data quality and training efficiency, the main difficulty is choosing the best examples rather than the complete datasets. In this paper, we propose an effective data reduction strategy based on Pointwise V-Information (PVI). To enable a static method, we first use PVI to quantify instance difficulty and remove instances with low difficulty. Experiments show that classifier performance is maintained with only a 0.0001% to 0.76% decline in accuracy when 10%-30% of the data is removed. Second, we train the classifiers using a progressive learning strategy on examples sorted by increasing PVI, accelerating convergence and achieving a 0.8% accuracy gain over conventional…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Compression Techniques · Rough Sets and Fuzzy Logic · Statistical Methods and Inference