
TL;DR
This paper introduces a formal approach to unshuffling data format chunks with common length statistics, providing probabilistic bounds and algorithms to aid reverse engineering of complex data formats.
Contribution
It formalizes the unshuffling problem, derives probabilistic bounds, and presents algorithms, advancing data format reverse engineering techniques.
Findings
Unshuffling can be effectively formalized and bounded probabilistically.
Algorithms for unshuffling improve reverse engineering of chunked data formats.
Connections with synchronization problems enhance understanding of data format structures.
Abstract
Data format reverse engineering commonly involves identifying conserved format motifs. However, this process typically requires establishing a common ordering for format elements across instances, particularly for formats using type-(length)-value tuples or "chunk" encoding. It is useful to \emph{unshuffle} chunks with common length statistics as a precursor to identifying conserved internal structures. We formalize the unshuffling problem and subsequently derive probabilistic bounds and outline corresponding algorithms for it. We empirically demonstrate unshuffling and highlight connections with the related class of synchronization problems.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Parallel Computing and Optimization Techniques · Machine Learning and Algorithms
