Random Permutation Codes: Lossless Source Coding of Non-Sequential Data
Daniel Severo

TL;DR
This thesis introduces a formal framework for lossless compression of non-sequential data types, utilizing Random Permutation Codes to achieve optimal rates by removing order-related redundancy.
Contribution
It formalizes non-sequential data as Combinatorial Random Variables and develops Random Permutation Codes for their efficient lossless compression.
Findings
Achieves full characterization of CRV rates based on data and equivalence relations.
Develops specialized RPCs for multisets, graphs, and clusterings.
Provides new algorithms for compressing databases, social networks, and web data.
Abstract
This thesis deals with the problem of communicating and storing non-sequential data. We investigate this problem through the lens of lossless source coding, also sometimes referred to as lossless compression, from both an algorithmic and information-theoretic perspective. Lossless compression algorithms typically preserve the ordering in which data points are compressed. However, there are data types where order is not meaningful, such as collections of files, rows in a database, nodes in a graph, and, notably, datasets in machine learning applications. Compressing with traditional algorithms is possible if we pick an order for the elements and communicate the corresponding ordered sequence. However, unless the order information is somehow removed during the encoding process, this procedure will be sub-optimal, because the order contains information and therefore more bits are used…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCooperative Communication and Network Coding · Wireless Communication Security Techniques · DNA and Biological Computing
