Pareto-optimal data compression for binary classification tasks

Max Tegmark (MIT); Tailin Wu (MIT)

arXiv:1908.08961·cs.LG·January 16, 2020

Pareto-optimal data compression for binary classification tasks

Max Tegmark (MIT), Tailin Wu (MIT)

PDF

Open Access 1 Repo

TL;DR

This paper introduces a method to map data into a compressed representation that optimally balances information retention about a class label and entropy, specifically for binary classification, using Pareto frontier analysis.

Contribution

It presents a novel approach to visualize and compute the Pareto frontier for data compression in classification tasks, including a lossless reduction to a real-valued variable and a binning strategy for binary cases.

Findings

01

Efficiently maps data to a real-valued variable preserving all class information.

02

Provides a method to sweep the Pareto frontier by binning the real-valued variable.

03

Demonstrates the approach on CIFAR-10, MNIST, and Fashion-MNIST datasets.

Abstract

The goal of lossy data compression is to reduce the storage cost of a data set $X$ while retaining as much information as possible about something ( $Y$ ) that you care about. For example, what aspects of an image $X$ contain the most information about whether it depicts a cat? Mathematically, this corresponds to finding a mapping $X \to Z \equiv f (X)$ that maximizes the mutual information $I (Z, Y)$ while the entropy $H (Z)$ is kept below some fixed threshold. We present a method for mapping out the Pareto frontier for classification tasks, reflecting the tradeoff between retained entropy and class information. We first show how a random variable $X$ (an image, say) drawn from a class $Y \in {1, ..., n}$ can be distilled into a vector $W = f (X) \in R^{n - 1}$ losslessly, so that $I (W, Y) = I (X, Y)$ ; for example, for a binary classification task of cats and dogs, each image $X$ is mapped into a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tailintalent/distillation
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Machine Learning and Algorithms · Evolutionary Algorithms and Applications