T5 meets Tybalt: Author Attribution in Early Modern English Drama Using   Large Language Models

Rebecca M. M. Hicke; David Mimno

arXiv:2310.18454·cs.CL·October 31, 2023·1 cites

T5 meets Tybalt: Author Attribution in Early Modern English Drama Using Large Language Models

Rebecca M. M. Hicke, David Mimno

PDF

Open Access

TL;DR

This paper explores the use of large language models, especially fine-tuned T5, for author attribution in Early Modern English drama, revealing both high accuracy on short texts and potential biases from training data.

Contribution

It demonstrates the effectiveness of fine-tuned T5 models for stylometry and highlights challenges related to training data influence on predictions.

Findings

01

Fine-tuned T5 outperforms traditional baselines in author attribution.

02

LLMs can accurately identify authors from very short passages.

03

Pre-training data influences attribution results, raising bias concerns.

Abstract

Large language models have shown breakthrough potential in many NLP domains. Here we consider their use for stylometry, specifically authorship identification in Early Modern English drama. We find both promising and concerning results; LLMs are able to accurately predict the author of surprisingly short passages but are also prone to confidently misattribute texts to specific authors. A fine-tuned t5-large model outperforms all tested baselines, including logistic regression, SVM with a linear kernel, and cosine delta, at attributing small passages. However, we see indications that the presence of certain authors in the model's pre-training data affects predictive results in ways that are difficult to assess.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAuthorship Attribution and Profiling · Natural Language Processing Techniques · Text Readability and Simplification

MethodsSupport Vector Machine