Greedy Shortest Common Superstring Approximation in Compact Space
Jarno Alanko, Tuukka Norri

TL;DR
This paper introduces a space-efficient implementation of the greedy heuristic for the shortest common superstring problem, enabling practical use on large datasets like DNA fragments.
Contribution
It provides the first time and space efficient implementation of the greedy heuristic for the shortest common superstring problem.
Findings
Works in O(n log σ) time and space
Uses roughly 5 n log σ bits of space on real datasets
Efficiently handles DNA fragment datasets
Abstract
Given a set of strings, the shortest common superstring problem is to find the shortest possible string that contains all the input strings. The problem is NP-hard, but a lot of work has gone into designing approximation algorithms for solving the problem. We present the first time and space efficient implementation of the classic greedy heuristic which merges strings in decreasing order of overlap length. Our implementation works in time and bits of space, where is the total length of the input strings in characters, and is the size of the alphabet. After index construction, a practical implementation of our algorithm uses roughly bits of space and reasonable time for a real dataset that consists of DNA fragments.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · DNA and Biological Computing · Network Packet Processing and Optimization
