Retrieval-based Text Selection for Addressing Class-Imbalanced Data in   Classification

Sareh Ahmadi; Aditya Shah; Edward Fox

arXiv:2307.14899·cs.CL·November 13, 2023

Retrieval-based Text Selection for Addressing Class-Imbalanced Data in Classification

Sareh Ahmadi, Aditya Shah, Edward Fox

PDF

Open Access

TL;DR

This paper proposes a retrieval-based method using SHAP and semantic search to select texts for annotation, improving classifier performance on imbalanced binary classification tasks in health-related datasets.

Contribution

It introduces a novel approach combining SHAP, vector search, and semantic search for effective text selection in class-imbalanced scenarios, with demonstrated improvements.

Findings

01

Improved F1 scores for minority classes.

02

Effective selection of annotation sets.

03

Enhanced classifier performance on imbalanced data.

Abstract

This paper addresses the problem of selecting of a set of texts for annotation in text classification using retrieval methods when there are limits on the number of annotations due to constraints on human resources. An additional challenge addressed is dealing with binary categories that have a small number of positive instances, reflecting severe class imbalance. In our situation, where annotation occurs over a long time period, the selection of texts to be annotated can be made in batches, with previous annotations guiding the choice of the next set. To address these challenges, the paper proposes leveraging SHAP to construct a quality set of queries for Elasticsearch and semantic search, to try to identify optimal sets of texts for annotation that will help with class imbalance. The approach is tested on sets of cue texts describing possible future events, constructed by participants…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsText and Document Classification Technologies · Imbalanced Data Classification Techniques

MethodsShapley Additive Explanations