The Capacity of Associated Subsequence Retrieval
Behrooz Tahmasebi, Mohammad Ali Maddah-Ali, Seyed Abolfazl Motahari

TL;DR
This paper introduces an information-theoretic framework for associated subsequence retrieval in genomic data, establishing the capacity and thresholds for accurately identifying relevant subsequences linked to observable traits.
Contribution
It formulates the associated subsequence retrieval problem, derives its capacity, and provides achievable schemes and converses for zero-error and epsilon-error scenarios.
Findings
Threshold effect in error probability versus rate curve.
Capacity characterized for zero-error and epsilon-error cases.
Achievable schemes and converses match, establishing optimality.
Abstract
The objective of a genome-wide association study (GWAS) is to associate subsequences of individuals' genomes to the observable characteristics called phenotypes (e.g., high blood pressure). Motivated by the GWAS problem, in this paper we introduce the information-theoretic problem of \emph{associated subsequence retrieval}, where a dataset of (possibly high-dimensional) sequences of length , and their corresponding observable (binary) characteristics is given. The sequences are chosen independently and uniformly at random from , where is a finite alphabet. The observable (binary) characteristic is only related to a specific unknown subsequence of length of the sequences, called \textit{associated subsequence}. For each sequence, if the associated subsequence of it belongs to a universal finite set, then it is more likely to display the observable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
