Exploiting Class Labels to Boost Performance on Embedding-based Text   Classification

Arkaitz Zubiaga

arXiv:2006.02104·cs.CL·September 3, 2020

Exploiting Class Labels to Boost Performance on Embedding-based Text Classification

Arkaitz Zubiaga

PDF

TL;DR

This paper introduces a novel weighting scheme called TF-CR that leverages class label information to enhance embedding-based text classification, resulting in improved performance across multiple datasets.

Contribution

The paper proposes TF-CR, a new weighting method that incorporates class distribution information to improve embedding-based text classification accuracy.

Findings

01

TF-CR outperforms TF-IDF and KLD in experiments

02

Improved classification performance on eight datasets

03

Effective use of class label information in weighting scheme

Abstract

Text classification is one of the most frequent tasks for processing textual data, facilitating among others research from large-scale datasets. Embeddings of different kinds have recently become the de facto standard as features used for text classification. These embeddings have the capacity to capture meanings of words inferred from occurrences in large external collections. While they are built out of external collections, they are unaware of the distributional characteristics of words in the classification dataset at hand, including most importantly the distribution of words across classes in training data. To make the most of these embeddings as features and to boost the performance of classifiers using them, we introduce a weighting scheme, Term Frequency-Category Ratio (TF-CR), which can weight high-frequency, category-exclusive words higher when computing word embeddings. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.