Scaling Training Data with Lossy Image Compression

Katherine L. Mentzer; Andrea Montanari

arXiv:2407.17954·cs.CV·July 26, 2024

Scaling Training Data with Lossy Image Compression

Katherine L. Mentzer, Andrea Montanari

PDF

1 Repo

TL;DR

This paper introduces a storage scaling law for lossy image compression in training data, enabling optimization of compression levels to improve model performance under storage constraints.

Contribution

It proposes and empirically validates a new law describing the relationship between test error, sample size, and bits per image, optimizing data compression for machine learning.

Findings

01

The law accurately predicts test error across compression levels.

02

Optimally compressed images lead to lower test error at fixed storage.

03

Randomizing compression levels offers potential benefits.

Abstract

Empirically-determined scaling laws have been broadly successful in predicting the evolution of large machine learning models with training data and number of parameters. As a consequence, they have been useful for optimizing the allocation of limited resources, most notably compute time. In certain applications, storage space is an important constraint, and data format needs to be chosen carefully as a consequence. Computer vision is a prominent example: images are inherently analog, but are always stored in a digital format using a finite number of bits. Given a dataset of digital images, the number of bits $L$ to store each of them can be further reduced using lossy data compression. This, however, can degrade the quality of the model trained on such images, since each example has lower resolution. In order to capture this trade-off and optimize storage of training data, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

granica-ai/lossycompressionscalingkdd2024
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.