ZKPROV: A Zero-Knowledge Approach to Dataset Provenance for Large Language Models
Mina Namazi, Alexander Nemecek, Erman Ayday

TL;DR
ZKPROV introduces a cryptographic framework that enables verifiable, privacy-preserving proof of dataset provenance for large language models, balancing efficiency and security in sensitive applications.
Contribution
It presents a novel zero-knowledge proof system for dataset provenance verification in LLMs, ensuring confidentiality and efficiency.
Findings
Sublinear proof generation and verification times
End-to-end overhead under 3.3 seconds for 8B parameter models
Formal security guarantees for dataset confidentiality
Abstract
As large language models (LLMs) are used in sensitive fields, accurately verifying their computational provenance without disclosing their training datasets poses a significant challenge, particularly in regulated sectors such as healthcare, which have strict requirements for dataset use. Traditional approaches either incur substantial computational cost to fully verify the entire training process or leak unauthorized information to the verifier. Therefore, we introduce ZKPROV, a novel cryptographic framework allowing users to verify that the LLM's responses to their prompts are trained on datasets certified by the authorities that own them. Additionally, it ensures that the dataset's content is relevant to the users' queries without revealing sensitive information about the datasets or the model parameters. ZKPROV offers a unique balance between privacy and efficiency by binding…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Topic Modeling · Data Quality and Management
MethodsFocus
