Automatic Identification of Subjects for Textual Documents in Digital   Libraries

Kuang-hua Chen

arXiv:cs/9902002·cs.DL·May 23, 2007·33 cites

Automatic Identification of Subjects for Textual Documents in Digital Libraries

Kuang-hua Chen

PDF

Open Access

TL;DR

This paper presents a model for automatically identifying subjects in textual documents by considering multiple linguistic factors, achieving performance comparable to humans.

Contribution

It introduces a novel model that combines word importance, frequency, co-occurrence, and distance for subject identification, expanding beyond noun-focused methods.

Findings

01

Model performance is close to human judgment.

02

Incorporating verbs improves subject detection accuracy.

03

The approach effectively handles well-organized, event-driven texts.

Abstract

The amount of electronic documents in the Internet grows very quickly. How to effectively identify subjects for documents becomes an important issue. In past, the researches focus on the behavior of nouns in documents. Although subjects are composed of nouns, the constituents that determine which nouns are subjects are not only nouns. Based on the assumption that texts are well-organized and event-driven, nouns and verbs together contribute the process of subject identification. This paper considers four factors: 1) word importance, 2) word frequency, 3) word co-occurrence, and 4) word distance and proposes a model to identify subjects for textual documents. The preliminary experiments show that the performance of the proposed model is close to that of human beings.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems