TL;DR
GerAV introduces a large German authorship verification benchmark with diverse data sources, enabling systematic evaluation of models, and demonstrates that fine-tuned LLMs outperform existing baselines and GPT-5 in this task.
Contribution
This paper presents GerAV, a new comprehensive benchmark for German AV with over 400k labeled pairs, and evaluates models showing fine-tuned LLMs achieve state-of-the-art results.
Findings
Fine-tuned LLMs outperform recent baselines by up to 0.09 F1 score.
Models trained on specific data types perform best in matching conditions.
Combining training sources improves model generalization across data regimes.
Abstract
Authorship verification (AV) is the task of determining whether two texts were written by the same author and has been studied extensively, predominantly for English data. In contrast, large-scale benchmarks and systematic evaluations for other languages remain scarce. We address this gap by introducing GerAV, a comprehensive benchmark for German AV comprising over 400k labeled text pairs. GerAV is built from Twitter and Reddit data, with the Reddit part further divided into in-domain and cross-domain message-based subsets, as well as a profile-based subset. This design enables controlled analysis of the effects of data source, topical domain, and text length. Using the provided training splits, we conduct a systematic evaluation of strong baselines and state-of-the-art models and find that our best approach, a fine-tuned large language model, outperforms recent baselines by up to 0.09…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
