MultiChain Blockchain Data Provenance for Deterministic Stream Processing with Kafka Streams: A Weather Data Case Study
Niaz Mohammad Ramaki, Florian Schintke

TL;DR
This paper presents a blockchain-based provenance architecture for Kafka Streams that ensures auditability and reproducibility of real-time weather data streams by cryptographically anchoring windowed data summaries on-chain.
Contribution
It introduces a novel method of storing cryptographic hashes of deterministic, windowed weather data streams on MultiChain blockchain to improve auditability without publishing full payloads on-chain.
Findings
Linear verification cost demonstrated with real weather data
Ensures deterministic reproducibility of streaming analytics
Achieves scalable off-chain storage with on-chain cryptographic anchoring
Abstract
Auditability and reproducibility still are critical challenges for real-time data streams pipelines. Streaming engines are highly dependent on runtime scheduling, window triggers, arrival orders, and uncertainties such as network jitters. These all derive the streaming pipeline platforms to throw non-determinist outputs. In this work, we introduce a blockchain-backed provenance architecture for streaming platform (e.g Kafka Streams) the publishes cryptographic data of a windowed data stream without publishing window payloads on-chain. We used real-time weather data from weather stations in Berlin. Weather records are canonicalized, deduplicated, and aggregated per window, then serialised deterministically. Furthermore, the Merkle root of the records within the window is computed and stored alongside with Kafka offsets boundaries to MultiChain blockchain streams as checkpoints. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Blockchain Technology Applications and Security · Distributed systems and fault tolerance
