Quality-Efficiency Trade-offs in Machine Learning for Text Processing
Ricardo Baeza-Yates, Zeinab Liaghat

TL;DR
This paper investigates the trade-offs between data size, training time, and quality in supervised machine learning for text processing, proposing a framework and analyzing three key tasks with large datasets.
Contribution
It introduces a performance trade-off framework for text processing tasks and evaluates how data size impacts algorithm performance and quality across multiple problems.
Findings
Most algorithms are faster and similarly effective on large data.
For small data, algorithm choice significantly affects quality.
Quality gains diminish as data size increases.
Abstract
Data mining, machine learning, and natural language processing are powerful techniques that can be used together to extract information from large texts. Depending on the task or problem at hand, there are many different approaches that can be used. The methods available are continuously being optimized, but not all these methods have been tested and compared in a set of problems that can be solved using supervised machine learning algorithms. The question is what happens to the quality of the methods if we increase the training data size from, say, 100 MB to over 1 GB? Moreover, are quality gains worth it when the rate of data processing diminishes? Can we trade quality for time efficiency and recover the quality loss by just being able to process more data? We attempt to answer these questions in a general way for text processing tasks, considering the trade-offs involving training…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
