Document Clustering based on Topic Maps

Muhammad Rafi; M. Shahid Shaikh; Amir Farooq

arXiv:1112.6219·cs.IR·December 30, 2011

Document Clustering based on Topic Maps

Muhammad Rafi, M. Shahid Shaikh, Amir Farooq

PDF

TL;DR

This paper proposes a novel document clustering method using Topic Map representations to better capture semantics, resulting in improved cluster quality over traditional models.

Contribution

It introduces a new semantic representation for documents based on Topic Maps and a similarity measure tailored for this structure, enhancing clustering effectiveness.

Findings

01

Improved cluster quality demonstrated on standard IR datasets

02

Semantic representation reduces dimensionality and captures core topics

03

Method outperforms traditional vector and suffix tree models

Abstract

Importance of document clustering is now widely acknowledged by researchers for better management, smart navigation, efficient filtering, and concise summarization of large collection of documents like World Wide Web (WWW). The next challenge lies in semantically performing clustering based on the semantic contents of the document. The problem of document clustering has two main components: (1) to represent the document in such a form that inherently captures semantics of the text. This may also help to reduce dimensionality of the document, and (2) to define a similarity measure based on the semantic representation such that it assigns higher numerical values to document pairs which have higher semantic relationship. Feature space of the documents can be very challenging for document clustering. A document may contain multiple topics, it may contain a large set of class-independent…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.