Entropy Coding of Unordered Data Structures
Julius Kunze, Daniel Severo, Giulio Zani, Jan-Willem van de Meent,, James Townsend

TL;DR
This paper introduces shuffle coding, a versatile entropy coding method for compressing unordered data structures like multisets and graphs, achieving state-of-the-art results on graph datasets.
Contribution
The paper proposes shuffle coding, a novel general approach for optimal compression of unordered data structures using bits-back coding, with an adaptable implementation.
Findings
Achieves state-of-the-art compression rates on graph datasets
Applicable to various data structures including multisets and hypergraphs
Provides an adaptable implementation for different data types
Abstract
We present shuffle coding, a general method for optimal compression of sequences of unordered objects using bits-back coding. Data structures that can be compressed using shuffle coding include multisets, graphs, hypergraphs, and others. We release an implementation that can easily be adapted to different data types and statistical models, and demonstrate that our implementation achieves state-of-the-art compression rates on a range of graph datasets including molecular data.
Peer Reviews
Decision·ICLR 2024 poster
It seems to be meaningful to reduce compression cost by removing the order information in data structure. The proposed shuffle coding can get a discount in lossless compression of such data structures, as illustrated by Equation 14.
My major concern is about the significance of the problem studied in this paper: considering the complexity, will the proposed method have wide/potential applications in practice? For my side, it seems slightly intuitive to remove the order information so that we can reduce the coding cost when we compressing graph data. Is bits-back coding necessary in this scheme? These my concern may partially be attributed to my lack of expertise in the field of compressing graphs. In addition, Appendix C d
This paper presents a few key strengths which, in my view, are as follows: __Elegant unified framework:__ This paper provides an unified theoretical framework for compressing unordered objects, such as multisets and graphs. This approach is based on the elegant idea that the order of the parts of an object does not matter, one can reduce the cost of communicating the object by getting a certain number of bits, i.e. the bits corresponding to a particular ordering of the parts, back. This general
The paper's main weaknesses, in my view, revolve around the practical applicability of shuffle coding: __Large runtime complexity:__ As the authors note, applying shuffle coding to a graph requires solving a graph isomorphism problem, for which no polynomial-time algorithm is known. This can be a significant hurdle when coding larger graphs. The authors brought up this issue in the paper, and suggested that approximately solving the isomorphism problem is a promising way to scale the method. Ho
## Strength * The unordered set / graph compression problem is of good practical value. The proposed approach is a neat extension of bits-back coding. It is simple, novel and works well.
## Weakness * As the authors have discussed, the current initial bits required is quite large. This hinders the practical application of the proposed approach to one-shot object coding. Though it is still possible to apply this approach to a dataset to amortize the initial bits. An alternative to the bit-swap approach mentioned by authors is correlation communication [Harsha 2010, The Communication Complexity of Correlation] [Li 2018, Strong Functional Representation Lemma and Applications to Co
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputability, Logic, AI Algorithms · Cellular Automata and Applications
