Single-Thread JPEG Decoder Benchmarks Mis-Evaluate ML Data Loaders
Vladimir Iglovikov, Dmitry Kosarevsky

TL;DR
This paper critically evaluates the accuracy of single-thread JPEG decoder benchmarks in predicting ML data loader performance across diverse CPU architectures, revealing significant discrepancies and biases.
Contribution
It introduces a comprehensive benchmarking protocol that challenges existing single-thread evaluations and provides a more accurate assessment of JPEG decoder performance in ML workloads.
Findings
Decoder rankings vary significantly across CPU architectures.
Worker count impacts performance conclusions differently on Zen 4 and Zen 5.
TensorFlow exhibits a large single-thread penalty on ARM.
Abstract
JPEG decode is routine ML infrastructure, but Python decoder choices are often justified by single-process, single-thread microbenchmarks. We audit this evaluation assumption with thirteen Python-accessible JPEG decode paths on five matched 16 vCPU Google Cloud CPUs: Intel Emerald Rapids, AMD Zen 4, AMD Zen 5, ARM Neoverse V2, and ARM Neoverse N1. ImageNet validation is the workload, not a new dataset contribution: each run decodes the full 50,000-image split from memory and reports single-thread throughput for all decoders, PyTorch \texttt{DataLoader} throughput for eligible decoders at worker counts , and decoder skip behavior. The evaluation protocol changes the supported conclusion. On Neoverse V2, \texttt{imageio} is ninth in single-thread throughput yet lands in the top DataLoader tier with \texttt{torchvision}; on Zen 4, \texttt{torchvision} rises from seventh…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
