Document Clustering based on Topic Maps
Muhammad Rafi, M. Shahid Shaikh, Amir Farooq

TL;DR
This paper proposes a novel document clustering method using Topic Map representations to better capture semantics, resulting in improved cluster quality over traditional models.
Contribution
It introduces a new semantic representation for documents based on Topic Maps and a similarity measure tailored for this structure, enhancing clustering effectiveness.
Findings
Improved cluster quality demonstrated on standard IR datasets
Semantic representation reduces dimensionality and captures core topics
Method outperforms traditional vector and suffix tree models
Abstract
Importance of document clustering is now widely acknowledged by researchers for better management, smart navigation, efficient filtering, and concise summarization of large collection of documents like World Wide Web (WWW). The next challenge lies in semantically performing clustering based on the semantic contents of the document. The problem of document clustering has two main components: (1) to represent the document in such a form that inherently captures semantics of the text. This may also help to reduce dimensionality of the document, and (2) to define a similarity measure based on the semantic representation such that it assigns higher numerical values to document pairs which have higher semantic relationship. Feature space of the documents can be very challenging for document clustering. A document may contain multiple topics, it may contain a large set of class-independent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
