Variations on the Problem of Identifying Spectrum-Preserving String Sets
Sankardeep Chakraborty, Roberto Grossi, Ren Kimura, Giulia Punzi, Kunihiko Sadakane, and Wiktor Zuba

TL;DR
This paper introduces necklace covers, a new method for compactly representing genomic $k$-mer spectra by extending path-based models to include cycles and branches, improving storage efficiency.
Contribution
It extends the spectrum-preserving string set framework from paths to necklace covers, combining cycles and branches, with a greedy algorithm that often produces near-optimal representations.
Findings
Necklace covers produce smaller representations than Eulertigs.
Achieves comparable compression to Masked Superstrings.
Maintains exact $k$-mer spectrum preservation.
Abstract
In computational genomics, many analyses rely on efficient storage and traversal of -mers, motivating compact representations such as spectrum-preserving string sets (SPSS), which store strings whose -mer spectrum matches that of the input. Existing approaches, including Unitigs, Eulertigs and Matchtigs, model this task as a path cover problem on the deBruijn graph. We extend this framework from paths to branching structures by introducing necklace covers, which combine cycles and tree-like attachments (pendants). We present a greedy algorithm that constructs a necklace cover while guaranteeing, under certain conditions, optimality in the cumulative size of the final representation. Experiments on real genomic datasets indicate that the minimum necklace cover achieves smaller representations than Eulertigs and comparable compression to the Masked Superstrings approach, while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenome Rearrangement Algorithms · Genomics and Phylogenetic Studies · Algorithms and Data Compression
