Utility-Theoretic Ranking for Semi-Automated Text Classification

Giacomo Berardi; Andrea Esuli; Fabrizio Sebastiani

arXiv:1503.00491·cs.LG·September 21, 2021

Utility-Theoretic Ranking for Semi-Automated Text Classification

Giacomo Berardi, Andrea Esuli, Fabrizio Sebastiani

PDF

Open Access

TL;DR

This paper introduces utility-theoretic ranking methods for semi-automated text classification, optimizing the validation process to maximize error reduction, outperforming simple confidence-based ranking strategies.

Contribution

It develops novel utility-theoretic ranking techniques based on validation gain and proposes a new effectiveness measure for SATC ranking methods.

Findings

01

Utility-theoretic methods outperform baseline confidence-based ranking.

02

Proposed measures effectively predict error reduction.

03

Experiments show significant improvements in classification accuracy.

Abstract

\emph{Semi-Automated Text Classification} (SATC) may be defined as the task of ranking a set $D$ of automatically labelled textual documents in such a way that, if a human annotator validates (i.e., inspects and corrects where appropriate) the documents in a top-ranked portion of $D$ with the goal of increasing the overall labelling accuracy of $D$ , the expected increase is maximized. An obvious SATC strategy is to rank $D$ so that the documents that the classifier has labelled with the lowest confidence are top-ranked. In this work we show that this strategy is suboptimal. We develop new utility-theoretic ranking methods based on the notion of \emph{validation gain}, defined as the improvement in classification effectiveness that would derive by validating a given automatically labelled document. We also propose a new effectiveness measure for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Machine Learning and Algorithms · Imbalanced Data Classification Techniques