# Clustering articles based on semantic similarity

**Authors:** Shenghui Wang, Rob Koopman

arXiv: 1702.04946 · 2017-02-17

## TL;DR

This paper presents a method for representing articles semantically based on associated entities, enabling effective clustering with K-Means and Louvain algorithms for improved topic identification.

## Contribution

It introduces a novel semantic representation of articles derived from entity associations, facilitating clustering with standard algorithms.

## Key findings

- Semantic representations enable effective clustering
- K-Means and Louvain algorithms produce meaningful clusters
- Comparison shows advantages over other solutions

## Abstract

Document clustering is generally the first step for topic identification. Since many clustering methods operate on the similarities between documents, it is important to build representations of these documents which keep their semantics as much as possible and are also suitable for efficient similarity calculation. The metadata of articles in the Astro dataset contribute to a semantic matrix, which uses a vector space to capture the semantics of entities derived from these articles and consequently supports the contextual exploration of these entities in LittleAriadne. However, this semantic matrix does not allow to calculate similarities between articles directly. In this paper, we will describe in detail how we build a semantic representation for an article from the entities that are associated with it. Base on such semantic representations of articles, we apply two standard clustering methods, K-Means and the Louvain community detection algorithm, which leads to our two clustering solutions labelled as OCLC-31 (standing for K-Means) and OCLC-Louvain (standing for Louvain). In this paper, we will give the implementation details and a basic comparison with other clustering solutions that are reported in this special issue.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1702.04946/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/1702.04946/full.md

## References

33 references — full list in the complete paper: https://tomesphere.com/paper/1702.04946/full.md

---
Source: https://tomesphere.com/paper/1702.04946