Understanding writing style in social media with a supervised contrastively pre-trained transformer
Javier Huertas-Tato, Alejandro Martin, David Camacho

TL;DR
This paper introduces STAR, a supervised contrastively pre-trained transformer model trained on a large social media corpus to improve authorship attribution and understanding of online harmful behaviors.
Contribution
We propose STAR, a novel author representation model trained on 4.5 million texts using supervised contrastive loss, achieving state-of-the-art zero-shot attribution and clustering performance.
Findings
Zero-shot attribution and clustering performance on PAN challenges
80% accuracy in author identification among 1616 authors
Effective authorship verification with a simple dense layer
Abstract
Online Social Networks serve as fertile ground for harmful behavior, ranging from hate speech to the dissemination of disinformation. Malicious actors now have unprecedented freedom to misbehave, leading to severe societal unrest and dire consequences, as exemplified by events such as the Capitol assault during the US presidential election and the Antivaxx movement during the COVID-19 pandemic. Understanding online language has become more pressing than ever. While existing works predominantly focus on content analysis, we aim to shift the focus towards understanding harmful behaviors by relating content to their respective authors. Numerous novel approaches attempt to learn the stylistic features of authors in texts, but many of these approaches are constrained by small datasets or sub-optimal training losses. To overcome these limitations, we introduce the Style Transformer for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAuthorship Attribution and Profiling · Hate Speech and Cyberbullying Detection · Topic Modeling
MethodsMulti-Head Attention · Attention Is All You Need · Dense Connections · Linear Layer · Softmax · Residual Connection · Absolute Position Encodings · Layer Normalization · Adam · Byte Pair Encoding
