Context binning, model clustering and adaptivity for data compression of genetic data
Jarek Duda

TL;DR
This paper introduces automated methods for optimizing statistical models, including context binning and model clustering, to improve data compression efficiency for large genetic datasets.
Contribution
It presents novel automated techniques for context binning and model clustering that enhance genetic data compression by capturing more information with fewer states.
Findings
Context binning automatically optimizes context reduction.
Model clustering using k-means improves model selection.
Techniques adapt to data non-stationarity.
Abstract
Rapid growth of genetic databases means huge savings from improvements in their data compression, what requires better inexpensive statistical models. This article proposes automatized optimizations e.g. of Markov-like models, especially context binning and model clustering. While it is popular to just remove low bits of the context, proposed context binning automatically optimizes such reduction as tabled: state=bin[context] determining probability distribution, this way extracting nearly all useful information also from very large contexts, into a relatively small number of states. The second proposed approach: model clustering uses k-means clustering in space of general statistical models, allowing to optimize a few models (as cluster centroids) to be chosen e.g. separately for each read. There are also briefly discussed some adaptivity techniques to include data non-stationarity.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Gene expression and cancer classification · Gaussian Processes and Bayesian Inference
