Inference-Sufficient Representations for High-Throughput Measurement: Lessons from Lossless Compression Benchmarks in 4D-STEM
Ondrej Dyck, Andrew R. Lupini, Albina Borisevich, Miaofang Chi, Rama K. Vasudevan, Stephen Jesse

TL;DR
This study benchmarks various lossless compression methods for 4D-STEM datasets, revealing that optimized algorithms can significantly reduce data size and transfer times, but highlight the need for inference-driven data representations for sustainable high-throughput workflows.
Contribution
The paper systematically compares lossless compression techniques for 4D-STEM data, providing practical guidance and emphasizing the importance of inference-sufficient representations over raw data storage.
Findings
Blosc extunderscore zstd achieves comparable compression to gzip-9 but is much faster.
Compression ratios are highly reproducible and follow a power law with data sparsity.
4D-STEM data can be routinely compressed by over 10 times.
Abstract
Four-dimensional scanning transmission electron microscopy (4D-STEM) generates multi-gigabyte datasets, creating a growing mismatch between acquisition rates and practical storage, transfer, and interactive visualization capabilities. We systematically benchmark 13 lossless compression implementations across 5 representative datasets (8~MiB to 8~GiB, 49.5--92.8\% sparsity), with 10 independent runs per method. HDF5 provides built-in gzip compression, of which gzip-9 typically achieves the highest compression ratio but is slow. We therefore evaluate widely available alternatives (via hdf5plugin), including the Blosc family. As a representative comparison, blosc\_zstd achieves compression comparable to gzip-9 (mean 13.5 vs 12.3) while compressing 19--69 faster and reading 1.9--2.6 faster across datasets. Compression ratios are deterministic, and timing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
