Enumeration of sequences with large alphabets
M. Oguzhan Kulekci

TL;DR
This paper develops efficient enumerative coding schemes for sequences over large alphabets, introducing a new method that outperforms basic schemes, especially for DNA sequence applications.
Contribution
The paper proposes a novel enumeration-based coding method for large alphabet sequences, improving efficiency over naive representations and extending to DNA sequences.
Findings
The new coding scheme reduces bits needed by approximately ( extsigma -1) log( extsigma -1) compared to naive methods.
Experimental results show the new method outperforms basic schemes for large alphabets.
The approach is effective for DNA sequence encoding, demonstrating practical utility.
Abstract
This study focuses on efficient schemes for enumerative coding of --ary sequences by mainly borrowing ideas from \"Oktem & Astola's \cite{Oktem99} hierarchical enumerative coding and Schalkwijk's \cite{Schalkwijk72} asymptotically optimal combinatorial code on binary sequences. By observing that the number of distinct --dimensional vectors having an inner sum of , where the values in each dimension are in range is , we propose representing vector via enumeration, and present necessary algorithms to perform this task. We prove requires approximately less bits than the naive representation for relatively large , and examine the results for varying alphabet sizes experimentally.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · semigroups and automata theory · Fractal and DNA sequence analysis
