Beyond position weight matrices: nucleotide correlations in transcription factor binding sites and their description
Marc Santolini, Thierry Mora, Vincent Hakim

TL;DR
This paper demonstrates that incorporating pairwise correlations into models of transcription factor binding sites improves their accuracy over traditional independent models, revealing significant interdependence between DNA bases in vivo.
Contribution
It introduces a pairwise interaction model based on maximum entropy principles that captures nucleotide correlations in TFBSs, outperforming PWM-based approaches.
Findings
Independent models do not reproduce observed TFBS statistics.
Pairwise interaction models improve TFBS prediction accuracy.
Most significant interactions are between consecutive base pairs.
Abstract
The identification of transcription factor binding sites (TFBSs) on genomic DNA is of crucial importance for understanding and predicting regulatory elements in gene networks. TFBS motifs are commonly described by Position Weight Matrices (PWMs), in which each DNA base pair independently contributes to the transcription factor (TF) binding, despite mounting evidence of interdependence between base pairs positions. The recent availability of genome-wide data on TF-bound DNA regions offers the possibility to revisit this question in detail for TF binding {\em in vivo}. Here, we use available fly and mouse ChIPseq data, and show that the independent model generally does not reproduce the observed statistics of TFBS, generalizing previous observations. We further show that TFBS description and predictability can be systematically improved by taking into account pairwise correlations in the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
