Writer Identification Using Microblogging Texts for Social Media   Forensics

Fernando Alonso-Fernandez; Nicole Mariah Sharon Belvisi; Kevin; Hernandez-Diaz; Naveed Muhammad; Josef Bigun

arXiv:2008.01533·cs.CL·November 29, 2021

Writer Identification Using Microblogging Texts for Social Media Forensics

Fernando Alonso-Fernandez, Nicole Mariah Sharon Belvisi, Kevin, Hernandez-Diaz, Naveed Muhammad, Josef Bigun

PDF

TL;DR

This paper investigates authorship identification of Twitter messages using stylometric and platform-specific features, demonstrating high accuracy with sufficient training data and offering insights into feature effectiveness and computational aspects.

Contribution

It introduces a comprehensive evaluation of stylometric and Twitter-specific features for authorship attribution on short texts, with automatic feature selection and analysis of performance across different data sizes.

Findings

01

High accuracy (>80% Rank-5) with over 500 training Tweets and few test Tweets.

02

Reduced candidate search space by 9-15% with small training samples.

03

Verification error rate below 15% with hundreds of training Tweets.

Abstract

Establishing authorship of online texts is fundamental to combat cybercrimes. Unfortunately, text length is limited on some platforms, making the challenge harder. We aim at identifying the authorship of Twitter messages limited to 140 characters. We evaluate popular stylometric features, widely used in literary analysis, and specific Twitter features like URLs, hashtags, replies or quotes. We use two databases with 93 and 3957 authors, respectively. We test varying sized author sets and varying amounts of training/test texts per author. Performance is further improved by feature combination via automatic selection. With a large number of training Tweets (>500), a good accuracy (Rank-5>80%) is achievable with only a few dozens of test Tweets, even with several thousands of authors. With smaller sample sizes (10-20 training Tweets), the search space can be diminished by 9-15% while…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.