# Genome- and peak-informed two-stage framework for scATAC-seq cell type identification

**Authors:** Yan Liu, Sheng Guan, He Yan, Long-Chen Shen, Yiheng Zhu, Ji-Peng Qiang, Guo Wei

PMC · DOI: 10.1093/bioinformatics/btaf682 · Bioinformatics · 2025-12-27

## TL;DR

This paper introduces seqAlignATAC, a new method for identifying cell types in scATAC-seq data by combining genomic sequence information and domain adaptation to improve accuracy and reduce batch effects.

## Contribution

The novel two-stage framework integrates nucleotide language models and adaptive alignment to enhance cell type annotation in scATAC-seq.

## Key findings

- seqAlignATAC effectively captures long-range genomic dependencies using pretrained nucleotide language models.
- The adaptive alignment module successfully mitigates batch effects across different datasets.
- seqAlignATAC demonstrates competitive accuracy and robustness in cell type identification.

## Abstract

Accurate cell type annotation is essential in scATAC-seq analysis, as it underpins the characterization of cellular heterogeneity, the identification of regulatory elements, and downstream biological discovery. However, current annotation methods still face major challenges. First, although some approaches attempt to integrate genomic sequence information, they typically rely on shallow sequence representations and thus fail to capture the long-range dependencies and regulatory signals encoded in DNA. Second, substantial batch effects introduced by different platforms, sequencing batches, or tissue sources remain insufficiently addressed. Existing models often lack robust distribution alignment and domain generalization capabilities, leading to confounding non-biological variation and reduced annotation accuracy across datasets.

To overcome these limitations, we propose seqAlignATAC, a two-stage intra-modality annotation framework that integrates sequence-derived embeddings with domain adaptation. In the first stage, we employ a large-scale pretrained nucleotide language model to extract low-dimensional, biologically informative representations from the genomic sequences of chromatin-accessible peaks. In the second stage, these embeddings are fed into a supervised neural network equipped with an adaptive alignment module to mitigate batch effects and harmonize feature distributions between labeled reference and unlabeled target datasets. Extensive experiments across multiple settings demonstrate that seqAlignATAC achieves competitive accuracy and robustness, effectively leveraging genome-level information while alleviating batch-induced distributional discrepancies.

The source code of seqAlignATAC is available at: https://github.com/BioCS-Lab/seqAlignATAC.

## Full-text entities

- **Genes:** NR3C1 (nuclear receptor subfamily 3 group C member 1) [NCBI Gene 2908] {aka GCCR, GCR, GCRST, GR, GRL}

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12930843/full.md

## Figures

3 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12930843/full.md

## References

27 references — full list in the complete paper: https://tomesphere.com/paper/PMC12930843/full.md

---
Source: https://tomesphere.com/paper/PMC12930843