ZKPROV: A Zero-Knowledge Approach to Dataset Provenance for Large Language Models

Mina Namazi; Alexander Nemecek; Erman Ayday

arXiv:2506.20915·cs.CR·December 22, 2025

ZKPROV: A Zero-Knowledge Approach to Dataset Provenance for Large Language Models

Mina Namazi, Alexander Nemecek, Erman Ayday

PDF

Open Access

TL;DR

ZKPROV introduces a cryptographic framework that enables verifiable, privacy-preserving proof of dataset provenance for large language models, balancing efficiency and security in sensitive applications.

Contribution

It presents a novel zero-knowledge proof system for dataset provenance verification in LLMs, ensuring confidentiality and efficiency.

Findings

01

Sublinear proof generation and verification times

02

End-to-end overhead under 3.3 seconds for 8B parameter models

03

Formal security guarantees for dataset confidentiality

Abstract

As large language models (LLMs) are used in sensitive fields, accurately verifying their computational provenance without disclosing their training datasets poses a significant challenge, particularly in regulated sectors such as healthcare, which have strict requirements for dataset use. Traditional approaches either incur substantial computational cost to fully verify the entire training process or leak unauthorized information to the verifier. Therefore, we introduce ZKPROV, a novel cryptographic framework allowing users to verify that the LLM's responses to their prompts are trained on datasets certified by the authorities that own them. Additionally, it ensures that the dataset's content is relevant to the users' queries without revealing sensitive information about the datasets or the model parameters. ZKPROV offers a unique balance between privacy and efficiency by binding…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsScientific Computing and Data Management · Topic Modeling · Data Quality and Management

MethodsFocus