Tagged Documents Co-Clustering

Ga\"elle Candel; David Naccache

arXiv:2110.11079·cs.IR·October 22, 2021

Tagged Documents Co-Clustering

Ga\"elle Candel, David Naccache

PDF

Open Access

TL;DR

This paper introduces a hierarchical co-clustering method for tags associated with various resources, aiming to group related tags into meaningful clusters to improve information retrieval and recommendation systems.

Contribution

It presents a novel hierarchical agglomerative co-clustering algorithm for tags, including preprocessing techniques to handle power-law distributions and an unsupervised stopping criterion.

Findings

01

Effective on synthetic and real-world datasets

02

Improves tag grouping for better resource retrieval

03

Handles power-law tag distributions

Abstract

Tags are short sequences of words allowing to describe textual and non-texual resources such as as music, image or book. Tags could be used by machine information retrieval systems to access quickly a document. These tags can be used to build recommender systems to suggest similar items to a user. However, the number of tags per document is limited, and often distributed according to a Zipf law. In this paper, we propose a methodology to cluster tags into conceptual groups. Data are preprocessed to remove power-law effects and enhance the context of low-frequency words. Then, a hierarchical agglomerative co-clustering algorithm is proposed to group together the most related tags into clusters. The capabilities were evaluated on a sparse synthetic dataset and a real-world tag collection associated with scientific papers. The task being unsupervised, we propose some stopping criterion for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Clustering Algorithms Research · Complex Network Analysis Techniques · Text and Document Classification Technologies