TL;DR
This paper introduces a method for sharing annotated copyrighted texts by using non-reversible hashing, enabling lawful distribution while maintaining data privacy and robustness to version differences.
Contribution
The authors propose a novel hashing-based approach allowing lawful sharing of copyrighted corpora with high alignment accuracy.
Findings
Achieves 98.7% to 99.79% token alignment accuracy
Method is robust to reasonable divergences in source data versions
Publicly releases a Python implementation called novelshare
Abstract
While annotated corpora are crucial in the field of natural language processing (NLP), those containing copyrighted material are difficult to exchange among researchers. Yet, such corpora are necessary to fully represent the diversity of data found in the wild in the context of NLP tasks. We tackle this issue by proposing a method to lawfully and publicly share the annotations of copyrighted literary texts. The corpus creator shares the annotations in clear, along with a non-reversible hashed version of the source material. The corpus user must own the source material, and apply the same hash function to their own tokens, in order to match them to the shared annotations. Crucially, our method is robust to reasonable divergences in the version of the copyrighted data owned by the user. As an illustration, we present alignment experiments on different editions of novels. Our results show…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
