Project Alexandria: Towards Freeing Scientific Knowledge from Copyright Burdens via LLMs
Christoph Schuhmann, Gollam Rabby, Ameya Prabhu, Tawsif Ahmed, Andreas, Hochlehnert, Huu Nguyen, Nick Akinci, Ludwig Schmidt, Robert Kaczmarczyk,, S\"oren Auer, Jenia Jitsev, Matthias Bethge

TL;DR
This paper introduces Knowledge Units, a novel representation for scientific texts using LLMs, which preserves factual content and is legally defensible, enabling broader access and reuse of scientific knowledge.
Contribution
It proposes a new method to convert scholarly documents into style-agnostic, knowledge-preserving units, supported by legal analysis and empirical evidence of factual retention.
Findings
Knowledge Units are legally defensible under German and U.S. law.
They preserve approximately 95% of factual knowledge.
Open-source tools for conversion are provided.
Abstract
Paywalls, licenses and copyright rules often restrict the broad dissemination and reuse of scientific knowledge. We take the position that it is both legally and technically feasible to extract the scientific knowledge in scholarly texts. Current methods, like text embeddings, fail to reliably preserve factual content, and simple paraphrasing may not be legally sound. We propose a new idea for the community to adopt: convert scholarly documents into knowledge preserving, but style agnostic representations we term Knowledge Units using LLMs. These units use structured data capturing entities, attributes and relationships without stylistic content. We provide evidence that Knowledge Units (1) form a legally defensible framework for sharing knowledge from copyrighted research texts, based on legal analyses of German copyright law and U.S. Fair Use doctrine, and (2) preserve most (~95\%)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLibrary Science and Information Systems · Wikis in Education and Collaboration · Intellectual Property and Patents
