Tagged Documents Co-Clustering
Ga\"elle Candel, David Naccache

TL;DR
This paper introduces a hierarchical co-clustering method for tags associated with various resources, aiming to group related tags into meaningful clusters to improve information retrieval and recommendation systems.
Contribution
It presents a novel hierarchical agglomerative co-clustering algorithm for tags, including preprocessing techniques to handle power-law distributions and an unsupervised stopping criterion.
Findings
Effective on synthetic and real-world datasets
Improves tag grouping for better resource retrieval
Handles power-law tag distributions
Abstract
Tags are short sequences of words allowing to describe textual and non-texual resources such as as music, image or book. Tags could be used by machine information retrieval systems to access quickly a document. These tags can be used to build recommender systems to suggest similar items to a user. However, the number of tags per document is limited, and often distributed according to a Zipf law. In this paper, we propose a methodology to cluster tags into conceptual groups. Data are preprocessed to remove power-law effects and enhance the context of low-frequency words. Then, a hierarchical agglomerative co-clustering algorithm is proposed to group together the most related tags into clusters. The capabilities were evaluated on a sparse synthetic dataset and a real-world tag collection associated with scientific papers. The task being unsupervised, we propose some stopping criterion for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Clustering Algorithms Research · Complex Network Analysis Techniques · Text and Document Classification Technologies
