Exploiting Class Labels to Boost Performance on Embedding-based Text Classification
Arkaitz Zubiaga

TL;DR
This paper introduces a novel weighting scheme called TF-CR that leverages class label information to enhance embedding-based text classification, resulting in improved performance across multiple datasets.
Contribution
The paper proposes TF-CR, a new weighting method that incorporates class distribution information to improve embedding-based text classification accuracy.
Findings
TF-CR outperforms TF-IDF and KLD in experiments
Improved classification performance on eight datasets
Effective use of class label information in weighting scheme
Abstract
Text classification is one of the most frequent tasks for processing textual data, facilitating among others research from large-scale datasets. Embeddings of different kinds have recently become the de facto standard as features used for text classification. These embeddings have the capacity to capture meanings of words inferred from occurrences in large external collections. While they are built out of external collections, they are unaware of the distributional characteristics of words in the classification dataset at hand, including most importantly the distribution of words across classes in training data. To make the most of these embeddings as features and to boost the performance of classifiers using them, we introduce a weighting scheme, Term Frequency-Category Ratio (TF-CR), which can weight high-frequency, category-exclusive words higher when computing word embeddings. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
