Reddit is all you need: Authorship profiling for Romanian
Ecaterina \c{S}tef\u{a}nescu, Alexandru-Iulius Jerpelea

TL;DR
This paper introduces a novel Romanian Reddit-based corpus for authorship profiling, demonstrating how LLMs can infer demographic and personal traits from social media texts, and providing a foundation for future NLP research in this area.
Contribution
It creates the first Romanian social media corpus annotated with author traits and evaluates LLMs for authorship profiling, advancing NLP capabilities in this language.
Findings
Successfully built a 23k+ sample Romanian Reddit corpus
Demonstrated LLMs can infer demographic traits from social media texts
Released resources publicly for further research
Abstract
Authorship profiling is the process of identifying an author's characteristics based on their writings. This centuries old problem has become more intriguing especially with recent developments in Natural Language Processing (NLP). In this paper, we introduce a corpus of short texts in the Romanian language, annotated with certain author characteristic keywords; to our knowledge, the first of its kind. In order to do this, we exploit a social media platform called Reddit. We leverage its thematic community-based structure (subreddits structure), which offers information about the author's background. We infer an user's demographic and some broad personal traits, such as age category, employment status, interests, and social orientation based on the subreddit and other cues. We thus obtain a 23k+ samples corpus, extracted from 100+ Romanian subreddits. We analyse our dataset, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAuthorship Attribution and Profiling
