A Framework for Authorial Clustering of Shorter Texts in Latent Semantic   Spaces

Rafi Trad; Myra Spiliopoulou

arXiv:2011.15038·cs.CL·December 1, 2020

A Framework for Authorial Clustering of Shorter Texts in Latent Semantic Spaces

Rafi Trad, Myra Spiliopoulou

PDF

1 Repo

TL;DR

This paper introduces a framework for authorial clustering of short texts using a latent semantic space derived from non-parametric topic modeling, achieving significant dimensionality reduction and improved clustering performance.

Contribution

The authors propose a novel high-level framework utilizing a compact latent feature space for authorial clustering of short texts, incorporating both unsupervised and semi-supervised scenarios with constraint-based information.

Findings

01

Latent feature space reduces dimensionality by a factor of 1500.

02

Semi-supervised constraints improve clustering performance.

03

The framework performs well across multiple languages and genres.

Abstract

Authorial clustering involves the grouping of documents written by the same author or team of authors without any prior positive examples of an author's writing style or thematic preferences. For authorial clustering on shorter texts (paragraph-length texts that are typically shorter than conventional documents), the document representation is particularly important: very high-dimensional feature spaces lead to data sparsity and suffer from serious consequences like the curse of dimensionality, while feature selection may lead to information loss. We propose a high-level framework which utilizes a compact data representation in a latent feature space derived with non-parametric topic modeling. Authorial clusters are identified thereafter in two scenarios: (a) fully unsupervised and (b) semi-supervised where a small number of shorter texts are known to belong to the same author…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

rtrad89/authorship_clustering_code_repo
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsFeature Selection