Feature Selection Based on Term Frequency and T-Test for Text Categorization
Deqing Wang, Hui Zhang, Rui Liu, Weifeng Lv

TL;DR
This paper introduces a novel feature selection method for text categorization using term frequency and t-test, addressing limitations of existing document frequency-based methods, and demonstrates its competitive performance through extensive experiments.
Contribution
The paper proposes a new feature selection approach based on t-test and term frequency, improving reliability for low-frequency terms and capturing term distribution diversity.
Findings
Comparable or slightly better macro-F1 and micro-F1 scores than state-of-the-art methods
Effective in selecting discriminative features based on term distribution
Shows robustness across different classifiers and datasets
Abstract
Much work has been done on feature selection. Existing methods are based on document frequency, such as Chi-Square Statistic, Information Gain etc. However, these methods have two shortcomings: one is that they are not reliable for low-frequency terms, and the other is that they only count whether one term occurs in a document and ignore the term frequency. Actually, high-frequency terms within a specific category are often regards as discriminators. This paper focuses on how to construct the feature selection function based on term frequency, and proposes a new approach based on -test, which is used to measure the diversity of the distributions of a term between the specific category and the entire corpus. Extensive comparative experiments on two text corpora using three classifiers show that our new approach is comparable to or or slightly better than the state-of-the-art feature…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText and Document Classification Technologies · Advanced Text Analysis Techniques · Web Data Mining and Analysis
