Clustering scientific publications: lessons learned through experiments with a real citation network
Vu Thi Huong, Thorsten Koch

TL;DR
This paper evaluates graph-based clustering algorithms on a large citation network, revealing that default settings often underperform and emphasizing the importance of parameter tuning for meaningful results.
Contribution
It provides practical insights into the performance and tuning of spectral, Louvain, and Leiden clustering methods on large, real-world citation networks.
Findings
Scalable methods like Louvain and Leiden are efficient but require careful parameter tuning.
Default settings often lead to poor clustering quality in large, complex networks.
Effective clustering depends on understanding the specific structure of the citation graph.
Abstract
Clustering scientific publications can reveal underlying research structures within bibliographic databases. Graph-based clustering methods, such as spectral, Louvain, and Leiden algorithms, are frequently utilized due to their capacity to effectively model citation networks. However, their performance may degrade when applied to real-world data. This study evaluates the performance of these clustering algorithms on a citation graph comprising approx. 700,000 papers and 4.6 million citations extracted from Web of Science. The results show that while scalable methods like Louvain and Leiden perform efficiently, their default settings often yield poor partitioning. Meaningful outcomes require careful parameter tuning, especially for large networks with uneven structures, including a dense core and loosely connected papers. These findings highlight practical lessons about the challenges of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBiomedical Text Mining and Ontologies · Semantic Web and Ontologies
