Utility-Theoretic Ranking for Semi-Automated Text Classification
Giacomo Berardi, Andrea Esuli, Fabrizio Sebastiani

TL;DR
This paper introduces utility-theoretic ranking methods for semi-automated text classification, optimizing the validation process to maximize error reduction, outperforming simple confidence-based ranking strategies.
Contribution
It develops novel utility-theoretic ranking techniques based on validation gain and proposes a new effectiveness measure for SATC ranking methods.
Findings
Utility-theoretic methods outperform baseline confidence-based ranking.
Proposed measures effectively predict error reduction.
Experiments show significant improvements in classification accuracy.
Abstract
\emph{Semi-Automated Text Classification} (SATC) may be defined as the task of ranking a set of automatically labelled textual documents in such a way that, if a human annotator validates (i.e., inspects and corrects where appropriate) the documents in a top-ranked portion of with the goal of increasing the overall labelling accuracy of , the expected increase is maximized. An obvious SATC strategy is to rank so that the documents that the classifier has labelled with the lowest confidence are top-ranked. In this work we show that this strategy is suboptimal. We develop new utility-theoretic ranking methods based on the notion of \emph{validation gain}, defined as the improvement in classification effectiveness that would derive by validating a given automatically labelled document. We also propose a new effectiveness measure for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Machine Learning and Algorithms · Imbalanced Data Classification Techniques
