# Beyond blacklists: a critical assessment of exclusion set generation strategies and alternative approaches

**Authors:** Brydon P G Wall, Jonathan D Ogata, My Nguyen, Amy L Olex, Konstantinos V Floros, Anthony C Faber, Joseph L McClay, J Chuck Harrell, Mikhail G Dozmorov

PMC · DOI: 10.1093/bioinformatics/btag110 · Bioinformatics · 2026-03-13

## TL;DR

This paper evaluates methods to reduce alignment artifacts in genomic data, comparing blacklist strategies with sponge sequences and improved genome assemblies.

## Contribution

The study introduces sponge sequences and the T2T-CHM13 assembly as effective alternatives to traditional blacklists for reducing alignment artifacts.

## Key findings

- Pre-generated exclusion sets are hard to reproduce due to sensitivity to input data and aligner choice.
- Using sponge sequences in alignment reduced ChIP-seq signal correlation as effectively as blacklists while preserving biological signal.
- Sponge-based alignment had minimal impact on RNA-seq gene counts, showing broader utility.

## Abstract

Short-read sequencing data can be affected by alignment artifacts in certain genomic regions. Removing reads overlapping these exclusion regions, previously known as Blacklists, help to potentially improve biological signal. Alternatively, “sponge” or decoy sequences have been proposed to reduce alignment artifacts.

We examined the widely used Blacklist software and found that pre-generated exclusion sets were difficult to reproduce due to sensitivity to input data, aligner choice, and read length. We further explored the use of “sponge” sequences—unassembled genomic regions such as satellite DNA, ribosomal DNA, and mitochondrial DNA—as an alternative approach. We additionally investigated the effect of the T2T-CHM13 genome assembly on improving biological signals. Aligning reads to a genome that includes sponge sequences reduced signal correlation in ChIP-seq data comparably to Blacklist-derived exclusion sets while preserving biological signal. Sponge-based alignment also had minimal impact on RNA-seq gene counts, suggesting broader applicability beyond chromatin profiling. These results highlight the limitations of fixed exclusion sets, and recommend the use of the T2T-CHM13 assembly or, for the hg38 genome assembly, “sponge” sequences as an alignment-guided strategy for reducing artifacts and improving functional genomics analyses.

## Full-text entities

- **Genes:** SSX2 (SSX family member 2) [NCBI Gene 6757] {aka CT5.2, CT5.2A, HD21, HOM-MEL-40, SSX}, MTOR (mechanistic target of rapamycin kinase) [NCBI Gene 2475] {aka FRAP, FRAP1, FRAP2, RAFT1, RAPT1, SKS}, FOS (Fos proto-oncogene, AP-1 transcription factor subunit) [NCBI Gene 2353] {aka AP-1, C-FOS, p55}, CTCF (CCCTC-binding factor) [NCBI Gene 10664] {aka CFAP108, FAP108, MRD21}, FOXA1 (forkhead box A1) [NCBI Gene 3169] {aka HNF3A, TCF3A}, SREBF2 (sterol regulatory element binding transcription factor 2) [NCBI Gene 6721] {aka SREBP-2, SREBP2, bHLHd2}, BCL10 (BCL10 immune signaling adaptor) [NCBI Gene 8915] {aka CARMEN, CIPER, CLAP, IMD37, c-E10, mE10}, RAF1 (Raf-1 proto-oncogene, serine/threonine kinase) [NCBI Gene 5894] {aka CMD1NN, CRAF, NS5, Raf-1, c-Raf}, LGR5 (leucine rich repeat containing G protein-coupled receptor 5) [NCBI Gene 8549] {aka FEX, GPR49, GPR67, GRP49, HG38}, KDM5A (lysine demethylase 5A) [NCBI Gene 5927] {aka NEDEHC, RBBP-2, RBBP2, RBP2}, NOTCH2 (notch receptor 2) [NCBI Gene 4853] {aka AGS2, HJCYS, hN2}, BIK (BCL2 interacting killer) [NCBI Gene 638] {aka BIP1, BP4, NBK}, SPDYE1 (speedy/RINGO cell cycle regulator family member E1) [NCBI Gene 285955] {aka Ringo1, SPDYB2L2, SPDYE, WBSCR19}, ECT2 (epithelial cell transforming 2) [NCBI Gene 1894] {aka ARHGEF31}, REST (RE1 silencing transcription factor) [NCBI Gene 5978] {aka DFNA27, GINGF5, HGF5, NRSF, WT6, XBR}, FOSL2 (FOS like 2, AP-1 transcription factor subunit) [NCBI Gene 2355] {aka ACED, FRA2}, PIK3CA (phosphatidylinositol-4,5-bisphosphate 3-kinase catalytic subunit alpha) [NCBI Gene 5290] {aka CCM4, CLAPO, CLOVE, CWS5, HMH, MCAP}, E2F2 (E2F transcription factor 2) [NCBI Gene 1870] {aka E2F-2}, MAF (MAF bZIP transcription factor) [NCBI Gene 4094] {aka AYGRP, CCA4, CTRCT21, c-MAF}, TP53 (tumor protein p53) [NCBI Gene 7157] {aka BCC7, BMFS5, LFS1, P53, TRP53}, FANCE (FA complementation group E) [NCBI Gene 2178] {aka FACE, FAE}, BTG2 (BTG anti-proliferation factor 2) [NCBI Gene 7832] {aka APRO1, PC3, TIS21}, MLH1 (mutL homolog 1) [NCBI Gene 4292] {aka COCA2, FCC2, HNPCC, HNPCC2, LYNCH2, MLH-1}, JUN (Jun proto-oncogene, AP-1 transcription factor subunit) [NCBI Gene 3725] {aka AP-1, AP1, c-Jun, cJUN, p39}
- **Diseases:** Cancer (MESH:D009369), breast cancer (MESH:D001943), synovial sarcoma (MESH:D013584), multiple myeloma (MESH:D009101), oncogenes (MESH:D000074723), tumor suppressor (OMIM:601308)
- **Chemicals:** TAK-981 (-)
- **Species:** Mus musculus (house mouse, species) [taxon 10090], Homo sapiens (human, species) [taxon 9606], Drosophila melanogaster (fruit fly, species) [taxon 7227]
- **Mutations:** T2T, T2T
- **Cell lines:** SYO-1 — Homo sapiens (Human), Biphasic synovial sarcoma, Cancer cell line (CVCL_7146), GM12878 — Homo sapiens (Human), Transformed cell line (CVCL_7526)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC13020910/full.md

## Figures

3 figures with captions in the complete paper: https://tomesphere.com/paper/PMC13020910/full.md

## References

55 references — full list in the complete paper: https://tomesphere.com/paper/PMC13020910/full.md

---
Source: https://tomesphere.com/paper/PMC13020910