TL;DR
This paper introduces a framework for authorial clustering of short texts using a latent semantic space derived from non-parametric topic modeling, achieving significant dimensionality reduction and improved clustering performance.
Contribution
The authors propose a novel high-level framework utilizing a compact latent feature space for authorial clustering of short texts, incorporating both unsupervised and semi-supervised scenarios with constraint-based information.
Findings
Latent feature space reduces dimensionality by a factor of 1500.
Semi-supervised constraints improve clustering performance.
The framework performs well across multiple languages and genres.
Abstract
Authorial clustering involves the grouping of documents written by the same author or team of authors without any prior positive examples of an author's writing style or thematic preferences. For authorial clustering on shorter texts (paragraph-length texts that are typically shorter than conventional documents), the document representation is particularly important: very high-dimensional feature spaces lead to data sparsity and suffer from serious consequences like the curse of dimensionality, while feature selection may lead to information loss. We propose a high-level framework which utilizes a compact data representation in a latent feature space derived with non-parametric topic modeling. Authorial clusters are identified thereafter in two scenarios: (a) fully unsupervised and (b) semi-supervised where a small number of shorter texts are known to belong to the same author…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsFeature Selection
