Progressive Compressed Records: Taking a Byte out of Deep Learning Data
Michael Kuchnik, George Amvrosiadis, Virginia Smith

TL;DR
Progressive Compressed Records (PCRs) introduce a novel data format that uses progressive compression to reduce data fetching overhead, enabling faster deep learning training without increasing dataset size.
Contribution
PCRs combine progressive compression with an efficient storage layout, allowing datasets to be viewed at multiple fidelities without dataset size increase, improving training efficiency.
Findings
Datasets can tolerate over 50% compression for many tasks.
Automatic selection of compression levels is feasible and efficient.
PCRs can halve training bandwidth, potentially doubling training speed.
Abstract
Deep learning accelerators efficiently train over vast and growing amounts of data, placing a newfound burden on commodity networks and storage devices. A common approach to conserve bandwidth involves resizing or compressing data prior to training. We introduce Progressive Compressed Records (PCRs), a data format that uses compression to reduce the overhead of fetching and transporting data, effectively reducing the training time required to achieve a target accuracy. PCRs deviate from previous storage formats by combining progressive compression with an efficient storage layout to view a single dataset at multiple fidelities---all without adding to the total dataset size. We implement PCRs and evaluate them on a range of datasets, training tasks, and hardware architectures. Our work shows that: (i) the amount of compression a dataset can tolerate exceeds 50% of the original encoding…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Advanced Data Storage Technologies · Parallel Computing and Optimization Techniques
