Identifying Semantically Difficult Samples to Improve Text Classification
Shashank Mujumdar, Stuti Mehta, Hima Patel, Suman Mitra

TL;DR
This paper proposes a method to identify and address semantically difficult samples in text datasets, improving classification accuracy by up to 9% through a novel difficulty scoring approach based on semantic embeddings.
Contribution
It introduces a new difficulty scoring function for text samples based on semantic similarity and dissimilarity, enhancing model performance by focusing on challenging data points.
Findings
Up to 9% accuracy improvement on 13 datasets
Effective identification of semantically ambiguous samples
Qualitative analysis confirms the approach's usefulness
Abstract
In this paper, we investigate the effect of addressing difficult samples from a given text dataset on the downstream text classification task. We define difficult samples as being non-obvious cases for text classification by analysing them in the semantic embedding space; specifically - (i) semantically similar samples that belong to different classes and (ii) semantically dissimilar samples that belong to the same class. We propose a penalty function to measure the overall difficulty score of every sample in the dataset. We conduct exhaustive experiments on 13 standard datasets to show a consistent improvement of up to 9% and discuss qualitative results to show effectiveness of our approach in identifying difficult samples for a text classification model.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText and Document Classification Technologies · Natural Language Processing Techniques · Topic Modeling
