FSMJ: Feature Selection with Maximum Jensen-Shannon Divergence for Text   Categorization

Bo Tang; Haibo He

arXiv:1606.06366·stat.ML·June 22, 2016·2 cites

FSMJ: Feature Selection with Maximum Jensen-Shannon Divergence for Text Categorization

Bo Tang, Haibo He

PDF

Open Access

TL;DR

This paper introduces FSMJ, a new wrapper feature selection method based on Jensen-Shannon divergence for text categorization, utilizing real-valued features to improve discrimination and outperform existing methods.

Contribution

The paper proposes FSMJ, a novel greedy feature selection approach using JS-divergence with real-valued features, demonstrating superior performance over state-of-the-art methods.

Findings

01

FSMJ outperforms existing feature selection methods in text categorization.

02

JS-divergence increases monotonically with feature selection.

03

Real-valued features enhance discrimination in text classification.

Abstract

In this paper, we present a new wrapper feature selection approach based on Jensen-Shannon (JS) divergence, termed feature selection with maximum JS-divergence (FSMJ), for text categorization. Unlike most existing feature selection approaches, the proposed FSMJ approach is based on real-valued features which provide more information for discrimination than binary-valued features used in conventional approaches. We show that the FSMJ is a greedy approach and the JS-divergence monotonically increases when more features are selected. We conduct several experiments on real-life data sets, compared with the state-of-the-art feature selection approaches for text categorization. The superior performance of the proposed FSMJ approach demonstrates its effectiveness and further indicates its wide potential applications on data mining.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsText and Document Classification Technologies · Face and Expression Recognition · Spam and Phishing Detection