# Large-scale paired chain BCR analysis reveals antibody clonal family inference bias and enhances resolution with machine learning

**Authors:** Hao Wang, Kaixuan Wang, Qihang Xu, Linru Cai, Chuanxiang Huang, Linlin Chen, Yunliang Zang, Xihao Hu, Jian Zhang

PMC · DOI: 10.1371/journal.pcbi.1014077 · 2026-03-11

## TL;DR

This study shows that using only heavy chains to identify antibody clonal families can lead to errors, and introduces a new method that improves accuracy by incorporating light chain data.

## Contribution

The paper introduces fastBCR-p, a new method that improves clonal family inference by integrating light-chain data and correcting technical and biological artifacts.

## Key findings

- Heavy-chain-only clustering can misrepresent true clonal architecture by creating chain-mixed and pseudo-clonal clusters.
- fastBCR-p improves clonal inference by resolving technical artifacts and biological convergence in real-world datasets.
- The new method enhances the accuracy of tracking immune dynamics and identifying clinically relevant antibody lineages.

## Abstract

A fundamental question in immunology is how the adaptive immune system encodes antigen specificity while maintaining repertoire diversity. B cell receptor (BCR) or antibody clonal families, defined by groups of B cells descending from a common ancestor, are key to deciphering this encoding. Although paired heavy and light chains jointly determine antibody specificity, most repertoire analyses have historically relied on heavy-chain-only data due to the loss of native pairing information in bulk BCR sequencing. This reliance introduces potential biases in computational clonal cluster inference, which may complicate efforts to resolve disease-associated immune signatures. Here, we leverage large-scale paired-chain BCR sequencing data to demonstrate that heavy-chain-based clustering may misrepresent true clonal architecture, and identify two major artifacts: chain-mixed clusters, in which similar heavy chains are paired with distinct light chains, and naive-like pseudo-clonal clusters, which are detected in an individual’s naive B cell repertoire and exhibit highly similar heavy and light chains without reflecting true clonal expansion. To address these limitations, we present fastBCR-p, an optimized framework that integrates light-chain-informed subclustering, with public sequence aware refinement to improve clonal family inference. By resolving both technical artifacts and biological convergence, fastBCR-p improves the chain concordance and overall clustering quality of clonal inference in real-world datasets. This enables more accurate tracking of immune dynamics in health and disease and facilitates the identification of clinically relevant antibody lineages.

Our immune system protects us by producing a vast and diverse collection of antibodies, each designed to recognize a specific target. These antibodies are made by B cells, which expand and evolve in groups known as clonal families. Accurately identifying these clonal families from sequencing data is essential for understanding immune responses during infection, vaccination, and disease. Most existing computational methods infer B-cell clonal families using information from only one part of the antibody, the heavy chain. This limitation largely reflects the fact that traditional sequencing technologies often lose information about how heavy and light chains are naturally paired. However, both chains are required to define antibody specificity. Using large-scale datasets that preserve native heavy–light chain pairing, we show that heavy-chain-only approaches can introduce systematic errors. These include incorrectly grouping together unrelated B cells and falsely identifying naive B cells as expanded clones. To address these limitations, we developed fastBCR-p, which integrates light-chain information and accounts for shared (“public”) antibody sequences. By correcting both technical artifacts and biological convergence, fastBCR-p enables more accurate clonal family inference and improves the analysis of immune repertoire dynamics and facilitating the identification of clinically relevant antibody lineages.

## Full-text entities

- **Genes:** BCR (BCR activator of RhoGEF and GTPase) [NCBI Gene 613] {aka ALL, BCR1, CML, D22S11, D22S662, PHL}, IGKV@ (immunoglobulin kappa variable cluster) [NCBI Gene 3519] {aka IGKV, IGKV1, IGKV1@, IGKV2, IGKV2@, IGKV3}, IGH (immunoglobulin heavy locus) [NCBI Gene 3492] {aka IGD1, IGH.1@, IGH@, IGHD@, IGHDY1, IGHJ}, IGKJ (immunoglobulin kappa joining cluster) [NCBI Gene 7842] {aka IGKJ@}, SMOC1 (SPARC related modular calcium binding 1) [NCBI Gene 64093] {aka OAS}, IGLV@ (immunoglobulin lambda variable cluster) [NCBI Gene 3546] {aka IGLV}, IGLJ (immunoglobulin lambda joining cluster) [NCBI Gene 8217] {aka IGLJ@}, IGHD (immunoglobulin heavy constant delta) [NCBI Gene 3495]
- **Diseases:** SARS-CoV-2 infection (MESH:D000086382), Multiple Sclerosis (MESH:D009103), SHM (MESH:D013001), CMV (MESH:D003586), infection (MESH:D007239), LC (MESH:D000075363), HC (MESH:D006362)
- **Chemicals:** amino acid (MESH:D000596)
- **Species:** Homo sapiens (human, species) [taxon 9606], Severe acute respiratory syndrome coronavirus 2 (no rank) [taxon 2697049]

## Figures

50 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12998946/full.md

---
Source: https://tomesphere.com/paper/PMC12998946