Toward Optimal Feature Selection in Naive Bayes for Text Categorization

Bo Tang; Steven Kay; and Haibo He

arXiv:1602.02850·stat.ML·November 15, 2016

Toward Optimal Feature Selection in Naive Bayes for Text Categorization

Bo Tang, Steven Kay, and Haibo He

PDF

TL;DR

This paper introduces a novel feature selection framework for text categorization using information theory, including new divergence measures and methods that improve classifier efficiency and effectiveness.

Contribution

It proposes a new divergence measure, JMH, and two feature selection methods, MD and MD-χ², enhancing feature ranking for multi-class text classification.

Findings

01

Effective feature selection methods outperform existing approaches.

02

JMH divergence accurately measures multi-distribution divergence.

03

Experimental results show improved classification performance.

Abstract

Automated feature selection is important for text categorization to reduce the feature size and to speed up the learning process of classifiers. In this paper, we present a novel and efficient feature selection framework based on the Information Theory, which aims to rank the features with their discriminative capacity for classification. We first revisit two information measures: Kullback-Leibler divergence and Jeffreys divergence for binary hypothesis testing, and analyze their asymptotic properties relating to type I and type II errors of a Bayesian classifier. We then introduce a new divergence measure, called Jeffreys-Multi-Hypothesis (JMH) divergence, to measure multi-distribution divergence for multi-class classification. Based on the JMH-divergence, we develop two efficient feature selection methods, termed maximum discrimination ( $M D$ ) and $M D - χ^{2}$ methods, for text…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings