Identifying Semantically Difficult Samples to Improve Text   Classification

Shashank Mujumdar; Stuti Mehta; Hima Patel; Suman Mitra

arXiv:2302.06155·cs.CL·February 14, 2023

Identifying Semantically Difficult Samples to Improve Text Classification

Shashank Mujumdar, Stuti Mehta, Hima Patel, Suman Mitra

PDF

Open Access

TL;DR

This paper proposes a method to identify and address semantically difficult samples in text datasets, improving classification accuracy by up to 9% through a novel difficulty scoring approach based on semantic embeddings.

Contribution

It introduces a new difficulty scoring function for text samples based on semantic similarity and dissimilarity, enhancing model performance by focusing on challenging data points.

Findings

01

Up to 9% accuracy improvement on 13 datasets

02

Effective identification of semantically ambiguous samples

03

Qualitative analysis confirms the approach's usefulness

Abstract

In this paper, we investigate the effect of addressing difficult samples from a given text dataset on the downstream text classification task. We define difficult samples as being non-obvious cases for text classification by analysing them in the semantic embedding space; specifically - (i) semantically similar samples that belong to different classes and (ii) semantically dissimilar samples that belong to the same class. We propose a penalty function to measure the overall difficulty score of every sample in the dataset. We conduct exhaustive experiments on 13 standard datasets to show a consistent improvement of up to 9% and discuss qualitative results to show effectiveness of our approach in identifying difficult samples for a text classification model.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsText and Document Classification Technologies · Natural Language Processing Techniques · Topic Modeling