CTDGSI: A comprehensive exploitation of instance selection methods for automatic text classification. VII Concurso de Teses, Disserta\c{c}\~oes e Trabalhos de Gradua\c{c}\~ao em SI -- XXI Simp\'osio Brasileiro de Sistemas de Informa\c{c}\~ao

Washington Cunha; Leonardo Rocha; Marcos Andr\'e Gon\c{c}alves

arXiv:2506.07169·cs.CL·June 10, 2025

CTDGSI: A comprehensive exploitation of instance selection methods for automatic text classification. VII Concurso de Teses, Disserta\c{c}\~oes e Trabalhos de Gradua\c{c}\~ao em SI -- XXI Simp\'osio Brasileiro de Sistemas de Informa\c{c}\~ao

Washington Cunha, Leonardo Rocha, Marcos Andr\'e Gon\c{c}alves

PDF

TL;DR

This paper explores instance selection techniques for automatic text classification, proposing new methods that significantly reduce training data size and computational costs while maintaining model effectiveness, especially for large datasets and transformer models.

Contribution

It provides a comprehensive comparison of existing instance selection methods and introduces two novel noise- and redundancy-aware solutions tailored for large NLP datasets and transformer architectures.

Findings

01

Achieved an average 41% reduction in training data size without loss of effectiveness.

02

Demonstrated speedup improvements of up to 2.46x in training time.

03

Confirmed the untapped potential of instance selection in NLP tasks.

Abstract

Progress in Natural Language Processing (NLP) has been dictated by the rule of more: more data, more computing power and more complexity, best exemplified by the Large Language Models. However, training (or fine-tuning) large dense models for specific applications usually requires significant amounts of computing resources. This \textbf{Ph.D. dissertation} focuses on an under-investi\-gated NLP data engineering technique, whose potential is enormous in the current scenario known as Instance Selection (IS). The IS goal is to reduce the training set size by removing noisy or redundant instances while maintaining the effectiveness of the trained models and reducing the training process cost. We provide a comprehensive and scientifically sound comparison of IS methods applied to an essential NLP task -- Automatic Text Classification (ATC), considering several classification solutions and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.