Entropy Coding of Unordered Data Structures

Julius Kunze; Daniel Severo; Giulio Zani; Jan-Willem van de Meent,; James Townsend

arXiv:2408.08837·cs.LG·August 19, 2024

Entropy Coding of Unordered Data Structures

Julius Kunze, Daniel Severo, Giulio Zani, Jan-Willem van de Meent,, James Townsend

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces shuffle coding, a versatile entropy coding method for compressing unordered data structures like multisets and graphs, achieving state-of-the-art results on graph datasets.

Contribution

The paper proposes shuffle coding, a novel general approach for optimal compression of unordered data structures using bits-back coding, with an adaptable implementation.

Findings

01

Achieves state-of-the-art compression rates on graph datasets

02

Applicable to various data structures including multisets and hypergraphs

03

Provides an adaptable implementation for different data types

Abstract

We present shuffle coding, a general method for optimal compression of sequences of unordered objects using bits-back coding. Data structures that can be compressed using shuffle coding include multisets, graphs, hypergraphs, and others. We release an implementation that can easily be adapted to different data types and statistical models, and demonstrate that our implementation achieves state-of-the-art compression rates on a range of graph datasets including molecular data.

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 5· marginally below the acceptance thresholdConfidence 2

Strengths

It seems to be meaningful to reduce compression cost by removing the order information in data structure. The proposed shuffle coding can get a discount in lossless compression of such data structures, as illustrated by Equation 14.

Weaknesses

My major concern is about the significance of the problem studied in this paper: considering the complexity, will the proposed method have wide/potential applications in practice? For my side, it seems slightly intuitive to remove the order information so that we can reduce the coding cost when we compressing graph data. Is bits-back coding necessary in this scheme? These my concern may partially be attributed to my lack of expertise in the field of compressing graphs. In addition, Appendix C d

Reviewer 02Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

This paper presents a few key strengths which, in my view, are as follows: __Elegant unified framework:__ This paper provides an unified theoretical framework for compressing unordered objects, such as multisets and graphs. This approach is based on the elegant idea that the order of the parts of an object does not matter, one can reduce the cost of communicating the object by getting a certain number of bits, i.e. the bits corresponding to a particular ordering of the parts, back. This general

Weaknesses

The paper's main weaknesses, in my view, revolve around the practical applicability of shuffle coding: __Large runtime complexity:__ As the authors note, applying shuffle coding to a graph requires solving a graph isomorphism problem, for which no polynomial-time algorithm is known. This can be a significant hurdle when coding larger graphs. The authors brought up this issue in the paper, and suggested that approximately solving the isomorphism problem is a promising way to scale the method. Ho

Reviewer 03Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

## Strength * The unordered set / graph compression problem is of good practical value. The proposed approach is a neat extension of bits-back coding. It is simple, novel and works well.

Weaknesses

## Weakness * As the authors have discussed, the current initial bits required is quite large. This hinders the practical application of the proposed approach to one-shot object coding. Though it is still possible to apply this approach to a dataset to amortize the initial bits. An alternative to the bit-swap approach mentioned by authors is correlation communication [Harsha 2010, The Communication Complexity of Correlation] [Li 2018, Strong Functional Representation Lemma and Applications to Co

Code & Models

Repositories

juliuskunze/shuffle-coding
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputability, Logic, AI Algorithms · Cellular Automata and Applications