# Attributing authorship via the perplexity of authorial language models

**Authors:** Weihang Huang, Akira Murakami, Jack Grieve

PMC · DOI: 10.1371/journal.pone.0327081 · PLOS One · 2025-07-03

## TL;DR

This paper presents a new method for determining who wrote a document by using language models trained on each candidate author's writing.

## Contribution

A novel authorship attribution technique using fine-tuned language models and perplexity as a metric.

## Key findings

- The method meets or exceeds current state-of-the-art performance on benchmark datasets.
- Content words carry more authorship information than function words, challenging existing assumptions.
- The approach allows inspection of linguistic patterns that drive attribution decisions.

## Abstract

Authorship attribution is the task of identifying the most likely author of a questioned document from a set of candidate authors, where each candidate is represented by a writing sample. A wide range of quantitative methods for inferring authorship have been developed in stylometry, but the rise of Large Language Models (LLMs) offers new opportunities in this field. In this paper, we introduce a technique for authorship attribution based on fine-tuned LLMs. Our approach involves first further pretraining LLMs for each candidate author based on their known writings and then assigning the questioned document to the author whose Authorial Language Model (ALM) finds the questioned document most predictable, measured as the perplexity of the questioned document. We find that our approach meets or exceeds the current state-of-the-art on several standard benchmarking datasets. In addition, we show how our approach can be used to measure the predictability of each word in a questioned document for a given candidate ALM, allowing the linguistic patterns that drive our attributions to be inspected directly. Finally, we analyze what types of words generally drive successful attributions, finding that content words classes are characterized by a higher density of authorship information than function word classes, challenging a long-standing assumption of stylometry.

## Full-text entities

- **Diseases:** ALM (MESH:D007806), CNLL (MESH:D064726), GPT-2 (MESH:D020803)
- **Chemicals:** T (MESH:D014316), DON (MESH:C005914)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12225838/full.md

## Figures

7 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12225838/full.md

## References

46 references — full list in the complete paper: https://tomesphere.com/paper/PMC12225838/full.md

---
Source: https://tomesphere.com/paper/PMC12225838