Higher Criticism for Discriminating Word-Frequency Tables and Testing   Authorship

Alon Kipnis

arXiv:1911.01208·cs.CL·October 3, 2023·5 cites

Higher Criticism for Discriminating Word-Frequency Tables and Testing Authorship

Alon Kipnis

PDF

Open Access 2 Repos

TL;DR

This paper introduces an adaptation of the Higher Criticism test to measure similarity between word-frequency tables, improving authorship attribution accuracy and identifying key discriminating words with low variance.

Contribution

It presents a simple, tuning-free method that enhances authorship attribution by effectively measuring document similarity and highlighting characteristic words.

Findings

01

Achieves state-of-the-art accuracy in authorship attribution challenges.

02

Identifies low-variance, author-specific discriminating words.

03

HC-based measure is robust to topic variations.

Abstract

We adapt the Higher Criticism (HC) goodness-of-fit test to measure the closeness between word-frequency tables. We apply this measure to authorship attribution challenges, where the goal is to identify the author of a document using other documents whose authorship is known. The method is simple yet performs well without handcrafting and tuning; reporting accuracy at the state of the art level in various current challenges. As an inherent side effect, the HC calculation identifies a subset of discriminating words. In practice, the identified words have low variance across documents belonging to a corpus of homogeneous authorship. We conclude that in comparing the similarity of a new document and a corpus of a single author, HC is mostly affected by words characteristic of the author and is relatively unaffected by topic structure.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAuthorship Attribution and Profiling · Topic Modeling · Computational and Text Analysis Methods

MethodsTest