Towards a theoretical understanding of false positives in DNA motif finding
Amin Zia, Alan M. Moses

TL;DR
This paper provides a theoretical analysis showing that false positives in DNA motif finding are primarily influenced by dataset size, especially the number of sequences, rather than algorithmic weaknesses, and offers practical guidelines to reduce them.
Contribution
It derives a theoretical model linking false positives to dataset size, revealing their dependence on the number of sequences and providing practical rules to improve motif-finding accuracy.
Findings
False positives increase with dataset size, especially the number of sequences.
Reducing sequence length or increasing the number of sequences can decrease false positives.
Adding more sequences beyond a certain point does not significantly reduce false positives.
Abstract
Detection of false-positive motifs is one of the main causes of low performance in motif finding methods. It is generally assumed that false-positives are mostly due to algorithmic weakness of motif-finders. Here, however, we derive the theoretical dependence of false positives on dataset size and find that false positives can arise as a result of large dataset size, irrespective of the algorithm used. Interestingly, the false-positive strength depends more on the number of sequences in the dataset than it does on the sequence length. As expected, false-positives can be reduced by decreasing the sequence length or by adding more sequences to the dataset. The dependence on number of sequences, however, diminishes and reaches a plateau after which adding more sequences to the dataset does not reduce the false-positive rate significantly. Based on the theoretical results presented here, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Chromatin Dynamics · Genomics and Phylogenetic Studies · Algorithms and Data Compression
