Context binning, model clustering and adaptivity for data compression of   genetic data

Jarek Duda

arXiv:2201.05028·cs.IT·May 4, 2022·1 cites

Context binning, model clustering and adaptivity for data compression of genetic data

Jarek Duda

PDF

Open Access

TL;DR

This paper introduces automated methods for optimizing statistical models, including context binning and model clustering, to improve data compression efficiency for large genetic datasets.

Contribution

It presents novel automated techniques for context binning and model clustering that enhance genetic data compression by capturing more information with fewer states.

Findings

01

Context binning automatically optimizes context reduction.

02

Model clustering using k-means improves model selection.

03

Techniques adapt to data non-stationarity.

Abstract

Rapid growth of genetic databases means huge savings from improvements in their data compression, what requires better inexpensive statistical models. This article proposes automatized optimizations e.g. of Markov-like models, especially context binning and model clustering. While it is popular to just remove low bits of the context, proposed context binning automatically optimizes such reduction as tabled: state=bin[context] determining probability distribution, this way extracting nearly all useful information also from very large contexts, into a relatively small number of states. The second proposed approach: model clustering uses k-means clustering in space of general statistical models, allowing to optimize a few models (as cluster centroids) to be chosen e.g. separately for each read. There are also briefly discussed some adaptivity techniques to include data non-stationarity.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Gene expression and cancer classification · Gaussian Processes and Bayesian Inference