Using WordNet to Complement Training Information in Text Categorization

Manuel de Buenaga Rodriguez; Jose Maria Gomez Hidalgo; Belen Diaz; Agudo

arXiv:cmp-lg/9709007·cmp-lg·February 3, 2008·131 cites

Using WordNet to Complement Training Information in Text Categorization

Manuel de Buenaga Rodriguez, Jose Maria Gomez Hidalgo, Belen Diaz, Agudo

PDF

Open Access

TL;DR

This paper explores enhancing text categorization by integrating WordNet lexical data with traditional training methods, leading to improved performance especially for rare categories.

Contribution

It introduces a novel approach combining WordNet with Rocchio and Widrow-Hoff algorithms within the Vector Space Model for better text classification.

Findings

01

WordNet integration outperforms traditional methods

02

Improved classification of low frequency categories

03

Enhanced overall accuracy in text categorization

Abstract

Automatic Text Categorization (TC) is a complex and useful task for many natural language applications, and is usually performed through the use of a set of manually classified documents, a training collection. We suggest the utilization of additional resources like lexical databases to increase the amount of information that TC systems make use of, and thus, to improve their performance. Our approach integrates WordNet information with two training approaches through the Vector Space Model. The training approaches we test are the Rocchio (relevance feedback) and the Widrow-Hoff (machine learning) algorithms. Results obtained from evaluation show that the integration of WordNet clearly outperforms training approaches, and that an integrated technique can effectively address the classification of low frequency categories.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsText and Document Classification Technologies · Topic Modeling · Natural Language Processing Techniques