# How negative sampling shapes the performance of transcription factor binding site prediction models

**Authors:** Natan Tourne, Gaetan De Waele, Vanessa Vermeirssen, Willem Waegeman

PMC · DOI: 10.1093/bioinformatics/btag048 · Bioinformatics · 2026-01-27

## TL;DR

This study shows how different ways of selecting negative examples affect how well models can predict where transcription factors bind to DNA.

## Contribution

The study introduces a systematic evaluation of negative sampling techniques for transcription factor binding site prediction using high-quality test datasets.

## Key findings

- Genomic sampling of negatives based on similarity to positives performed best among tested techniques.
- Dinucleotide shuffled negatives led to poor model performance despite being commonly used.
- Training dataset metrics often overestimate model performance.

## Abstract

Transcription factors (TFs) are key players in gene regulation and development, where they activate and repress gene expression through DNA binding. Predicting transcription factor binding sites (TFBSs) has long been an active area of research, with many deep learning methods developed to tackle this problem. These models are often trained on TF ChIP-seq data, which is generally seen as only providing positive samples. The choice of datasets and negative sampling techniques is a critical yet often overlooked aspect of this work.

In this study, we investigate the impact of different negative sampling techniques on TFBS prediction performance. We create high-quality test datasets based on ChIP-seq and ATAC-seq data, where true negatives can be identified as positions that are accessible but not bound by the TF in question. We then train models using various negative sampling techniques, including genomic sampling, shuffling, dinucleotide shuffling, neighborhood sampling, and cell line specific sampling, simulating cases where matching ATAC-seq data is not available. Our results show that, generally, metrics calculated on training datasets give inflated performance scores. Of the tested techniques, genomic sampling of negatives based on similarity to the positives performed by far the best, although still not reaching the performance of baseline models trained on high-quality datasets. Models trained on dinucleotide shuffled negatives performed poorly, despite being a common practice in the field. Our findings highlight the importance of carefully selecting negative sampling techniques for TFBS prediction, as they can significantly impact model performance and the interpretation of results.

The code used in this study is available at https://github.com/NatanTourne/TFBS-negatives (DOI: 10.5281/zenodo.18007567).

## Full-text entities

- **Genes:** F3 (coagulation factor III, tissue factor) [NCBI Gene 2152] {aka CD142, TF, TFA}
- **Chemicals:** dinucleotide (MESH:D015226)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12910371/full.md

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12910371/full.md

## References

44 references — full list in the complete paper: https://tomesphere.com/paper/PMC12910371/full.md

---
Source: https://tomesphere.com/paper/PMC12910371