Knowledge-based Document Classification with Shannon Entropy

AtMa P.O. Chan

arXiv:2206.02363·cs.CL·June 7, 2022·1 cites

Knowledge-based Document Classification with Shannon Entropy

AtMa P.O. Chan

PDF

Open Access

TL;DR

This paper introduces a knowledge-based document classification method using Shannon Entropy to measure keyword match diversity, improving robustness and recall without requiring positive samples.

Contribution

It presents a novel entropy-based approach that enhances traditional knowledge-based classifiers by promoting diverse keyword matches and robustness against data distribution changes.

Findings

01

Shannon Entropy improves recall at fixed false positive rates.

02

The method is more robust to data distribution shifts.

03

It performs well even with limited positive training samples.

Abstract

Document classification is the detection specific content of interest in text documents. In contrast to the data-driven machine learning classifiers, knowledge-based classifiers can be constructed based on domain specific knowledge, which usually takes the form of a collection of subject related keywords. While typical knowledge-based classifiers compute a prediction score based on the keyword abundance, it generally suffers from noisy detections due to the lack of guiding principle in gauging the keyword matches. In this paper, we propose a novel knowledge-based model equipped with Shannon Entropy, which measures the richness of information and favors uniform and diverse keyword matches. Without invoking any positive sample, such method provides a simple and explainable solution for document classification. We show that the Shannon Entropy significantly improves the recall at fixed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsText and Document Classification Technologies · Machine Learning and Data Classification · Face and Expression Recognition