# A comprehensive benchmark of single-cell Hi-C embedding tools

**Authors:** Dylan Plummer, Xiuyuan Lang, Shanshan Zhang, Yan Li, Jing Li, Fulai Jin

PMC · DOI: 10.1038/s41467-025-64186-4 · 2025-10-14

## TL;DR

This paper benchmarks 13 tools for analyzing single-cell Hi-C data and finds that data representation and preprocessing are more important than the tools themselves for capturing genome architecture heterogeneity.

## Contribution

A new benchmarking framework for scHi-C embedding tools and insights into the impact of data representation and preprocessing.

## Key findings

- No single tool performs best across all datasets under default settings.
- Long-range contacts are better for embryonic stages, while short-range contacts are better for cell cycle and tissue complexity.
- Deep-learning methods handle sparsity better and are more versatile across resolutions.

## Abstract

Embedding is the key step in single-cell Hi-C (scHi-C) analysis which relies on capturing biological meaningful heterogeneity at various levels of genome architecture. To understand the strength and limitations of existing tools in various applications, here we use ten scHi-C datasets to benchmark thirteen embedding tools including Va3DE, a new convolutional neural network model that can accommodate large cell numbers. We built a software framework to decouple the preprocessing options of existing tools and found that no single tool works best across all datasets under default settings. The difficulty levels and preferred resolutions are different between benchmark datasets, and the choice of data representation and preprocessing strongly impact the embedding performance. Embedding cells from early embryonic stages relies on long-range compartment-scale contacts, but resolving cell cycle phases and complex tissue requires short-range loop-scale contacts. Both random-walk and inverse document frequency (IDF) transformation prefers long-range “compartment-scale” over short-range “loop-scale” embedding, while deep-learning methods better overcome sparsity at both scales and are more versatile with different resolutions. Finally, “diagonal integration” with independent data modal is a promising approach to distinguish similar cell subpopulations. Our findings underscore the significance of appropriate priors for scHi-C embedding and also offer insights into genome architecture heterogeneity.

Embedding is a key step in single-cell Hi-C analysis to identify cell states. Here, the authors benchmark 13 embedding methods in 10 scHi-C datasets. They find that data representation, preprocessing options, and biological settings are often more important considerations than the actual methods.

## Full-text entities

- **Genes:** PCSK1 (proprotein convertase subtilisin/kexin type 1) [NCBI Gene 5122] {aka BMIQ12, NEC1, PC1, PC1/3, PC3, SPC3}, Car3 (carbonic anhydrase 3) [NCBI Gene 12350] {aka Ca3, Car-3}
- **Diseases:** ARI (MESH:D000275), Higashi (MESH:D002609), ARIs (MESH:C535427), Fast-Higashi (MESH:D007003), IDF (MESH:D007446)
- **Chemicals:** Va3DE (-)
- **Species:** Mus musculus (house mouse, species) [taxon 10090], Homo sapiens (human, species) [taxon 9606]
- **Cell lines:** mESC — Mus musculus (Mouse), Embryonic stem cell (CVCL_4378), scHi-C — Trichoplusia ni (Cabbage looper), Spontaneously immortalized cell line (CVCL_C190), 64C — Mus musculus (Mouse), Hybridoma (CVCL_B7CZ)

## Figures

7 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12521359/full.md

---
Source: https://tomesphere.com/paper/PMC12521359