Inference-Sufficient Representations for High-Throughput Measurement: Lessons from Lossless Compression Benchmarks in 4D-STEM

Ondrej Dyck; Andrew R. Lupini; Albina Borisevich; Miaofang Chi; Rama K. Vasudevan; Stephen Jesse

arXiv:2604.06221·eess.SP·April 9, 2026

Inference-Sufficient Representations for High-Throughput Measurement: Lessons from Lossless Compression Benchmarks in 4D-STEM

Ondrej Dyck, Andrew R. Lupini, Albina Borisevich, Miaofang Chi, Rama K. Vasudevan, Stephen Jesse

PDF

TL;DR

This study benchmarks various lossless compression methods for 4D-STEM datasets, revealing that optimized algorithms can significantly reduce data size and transfer times, but highlight the need for inference-driven data representations for sustainable high-throughput workflows.

Contribution

The paper systematically compares lossless compression techniques for 4D-STEM data, providing practical guidance and emphasizing the importance of inference-sufficient representations over raw data storage.

Findings

01

Blosc extunderscore zstd achieves comparable compression to gzip-9 but is much faster.

02

Compression ratios are highly reproducible and follow a power law with data sparsity.

03

4D-STEM data can be routinely compressed by over 10 times.

Abstract

Four-dimensional scanning transmission electron microscopy (4D-STEM) generates multi-gigabyte datasets, creating a growing mismatch between acquisition rates and practical storage, transfer, and interactive visualization capabilities. We systematically benchmark 13 lossless compression implementations across 5 representative datasets (8~MiB to 8~GiB, 49.5--92.8\% sparsity), with 10 independent runs per method. HDF5 provides built-in gzip compression, of which gzip-9 typically achieves the highest compression ratio but is slow. We therefore evaluate widely available alternatives (via hdf5plugin), including the Blosc family. As a representative comparison, blosc\_zstd achieves compression comparable to gzip-9 (mean 13.5 $\times$ vs 12.3 $\times$ ) while compressing 19--69 $\times$ faster and reading 1.9--2.6 $\times$ faster across datasets. Compression ratios are deterministic, and timing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.