Overcoming Copyright Barriers in Corpus Distribution Through Non-Reversible Hashing

Arthur Amalvy; Vincent Labatut; Xavier Bost; Hen-Hsen Huang

arXiv:2604.23412·cs.CL·April 28, 2026

Overcoming Copyright Barriers in Corpus Distribution Through Non-Reversible Hashing

Arthur Amalvy, Vincent Labatut, Xavier Bost, Hen-Hsen Huang

PDF

1 Repo

TL;DR

This paper introduces a method for sharing annotated copyrighted texts by using non-reversible hashing, enabling lawful distribution while maintaining data privacy and robustness to version differences.

Contribution

The authors propose a novel hashing-based approach allowing lawful sharing of copyrighted corpora with high alignment accuracy.

Findings

01

Achieves 98.7% to 99.79% token alignment accuracy

02

Method is robust to reasonable divergences in source data versions

03

Publicly releases a Python implementation called novelshare

Abstract

While annotated corpora are crucial in the field of natural language processing (NLP), those containing copyrighted material are difficult to exchange among researchers. Yet, such corpora are necessary to fully represent the diversity of data found in the wild in the context of NLP tasks. We tackle this issue by proposing a method to lawfully and publicly share the annotations of copyrighted literary texts. The corpus creator shares the annotations in clear, along with a non-reversible hashed version of the source material. The corpus user must own the source material, and apply the same hash function to their own tokens, in order to match them to the shared annotations. Crucially, our method is robust to reasonable divergences in the version of the copyrighted data owned by the user. As an illustration, we present alignment experiments on different editions of novels. Our results show…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

null
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.