Importance Sampling of Word Patterns in DNA and Protein Sequences
Hock Peng Chan, Nancy R. Zhang, and Louis H.Y. Chen

TL;DR
This paper introduces efficient importance sampling algorithms for estimating small probabilities of specific word patterns in DNA and protein sequences, improving upon naive Monte Carlo methods especially for rare events.
Contribution
It presents novel importance sampling techniques that control the insertion of word patterns, enhancing accuracy and efficiency in biological sequence analysis.
Findings
Effective estimation of small probabilities for biological motifs.
Improved efficiency over naive Monte Carlo methods.
Application to biologically relevant patterns like palindromes and motifs.
Abstract
Monte Carlo methods can provide accurate p-value estimates of word counting test statistics and are easy to implement. They are especially attractive when an asymptotic theory is absent or when either the search sequence or the word pattern is too short for the application of asymptotic formulae. Naive direct Monte Carlo is undesirable for the estimation of small probabilities because the associated rare events of interest are seldom generated. We propose instead efficient importance sampling algorithms that use controlled insertion of the desired word patterns on randomly generated sequences. The implementation is illustrated on word patterns of biological interest: Palindromes and inverted repeats, patterns arising from position specific weight matrices and co-occurrences of pairs of motifs.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRNA and protein synthesis mechanisms · Genomics and Chromatin Dynamics · Algorithms and Data Compression
