Same or Different? Diff-Vectors for Authorship Analysis
Silvia Corbara, Alejandro Moreo, Fabrizio Sebastiani

TL;DR
This paper introduces Diff-Vectors, a novel representation for authorship analysis that compares document pairs directly, leading to improved identification performance especially with limited training data.
Contribution
It systematically studies Diff-Vectors for authorship tasks, demonstrating their advantages over traditional feature vectors and proposing new methods for verification and attribution.
Findings
Diff-Vectors improve authorship identification accuracy.
Diff-Vectors are especially effective with scarce training data.
New methods for authorship verification and attribution using Diff-Vectors.
Abstract
We investigate the effects on authorship identification tasks of a fundamental shift in how to conceive the vectorial representations of documents that are given as input to a supervised learner. In ``classic'' authorship analysis a feature vector represents a document, the value of a feature represents (an increasing function of) the relative frequency of the feature in the document, and the class label represents the author of the document. We instead investigate the situation in which a feature vector represents an unordered pair of documents, the value of a feature represents the absolute difference in the relative frequencies (or increasing functions thereof) of the feature in the two documents, and the class label indicates whether the two documents are from the same author or not. This latter (learner-independent) type of representation has been occasionally used before, but has…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAuthorship Attribution and Profiling · Hate Speech and Cyberbullying Detection
