HALvest-Contrastive: Retrieval-Like Authorship Attribution with Patch-Level Late Interaction

Francis Kulumba; Wissam Antoun; Guillaume Vimont; Laurent Romary; Florian Cafiero

arXiv:2407.20595·cs.DL·May 20, 2026

HALvest-Contrastive: Retrieval-Like Authorship Attribution with Patch-Level Late Interaction

Francis Kulumba, Wissam Antoun, Guillaume Vimont, Laurent Romary, Florian Cafiero

PDF

1 Repo 3 Datasets

TL;DR

This paper introduces HALvest-Contrastive, a novel authorship attribution method using patch-level late interaction, which improves performance by comparing sequences of text segments rather than single vectors.

Contribution

It proposes a new contrastive dataset and a patch-level late interaction technique for more accurate authorship attribution.

Findings

01

Sequence-level comparison outperforms single-vector methods.

02

Patch-Level Late Interaction (PLI) enhances matching accuracy.

03

Optimal interaction granularity is subtle and crucial for performance.

Abstract

Deciding whether two pieces of text share an author is made difficult by topical confound: two writers covering the same topic often look more alike than one writer covering two topics. We tackle this with HALvest, a 17-billion-token multilingual corpus of open-access scholarly papers, and its English contrastive derivative HALvest-Contrastive, in which same-author passages are drawn from distinct papers within a field to minimize topical overlap. We also revisit how documents are compared. Authorship systems traditionally compress each document into a single vector, we keep a sequence of vectors and compare them with late interaction, then introduce Patch-Level Late Interaction (PLI), which compresses neighboring tokens into patches before matching. Matching at the sequence level greatly improves performance over the single-vector baseline, but the optimal interaction granularity is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

madjakul/HALvesting
github

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsContrastive Learning