HALvest-Contrastive: Retrieval-Like Authorship Attribution with Patch-Level Late Interaction
Francis Kulumba, Wissam Antoun, Guillaume Vimont, Laurent Romary, Florian Cafiero

TL;DR
This paper introduces HALvest-Contrastive, a novel authorship attribution method using patch-level late interaction, which improves performance by comparing sequences of text segments rather than single vectors.
Contribution
It proposes a new contrastive dataset and a patch-level late interaction technique for more accurate authorship attribution.
Findings
Sequence-level comparison outperforms single-vector methods.
Patch-Level Late Interaction (PLI) enhances matching accuracy.
Optimal interaction granularity is subtle and crucial for performance.
Abstract
Deciding whether two pieces of text share an author is made difficult by topical confound: two writers covering the same topic often look more alike than one writer covering two topics. We tackle this with HALvest, a 17-billion-token multilingual corpus of open-access scholarly papers, and its English contrastive derivative HALvest-Contrastive, in which same-author passages are drawn from distinct papers within a field to minimize topical overlap. We also revisit how documents are compared. Authorship systems traditionally compress each document into a single vector, we keep a sequence of vectors and compare them with late interaction, then introduce Patch-Level Late Interaction (PLI), which compresses neighboring tokens into patches before matching. Matching at the sequence level greatly improves performance over the single-vector baseline, but the optimal interaction granularity is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsContrastive Learning
