Allele-specific immune gene quantification and expression analysis in single-cell RNA-seq data

Ahmad Al Ajami; Jonas Schuck; Federico Marini; Katharina Imkeller

PMC · DOI:10.1093/nargab/lqaf149·November 11, 2025

Allele-specific immune gene quantification and expression analysis in single-cell RNA-seq data

Ahmad Al Ajami, Jonas Schuck, Federico Marini, Katharina Imkeller

PDF

Open Access

TL;DR

This paper introduces a new computational method to analyze immune gene expression at the allele level in single-cell RNA-seq data, offering insights into genetic diversity in immune molecules.

Contribution

A novel computational methodology for allele-specific expression quantification of immune genes in scRNA-seq data.

Findings

01

The method enables interactive exploration of immune gene expression across multiple immunogenetic annotation layers.

02

Validation on multiple scRNA-seq datasets shows performance across different experimental setups.

03

The approach reveals loss of HLA expression in tumor cells and differential HLA allele expression in immune cell subtypes.

Abstract

Immune molecules such as human leukocyte antigens (HLAs) or killer Ig-like receptors are encoded in the most genetically diverse loci of the human genome. Many of these immune genes exhibit remarkable allelic diversity across populations. Here, we present a novel computational methodology for allele-specific expression quantification of immune genes in single-cell RNA sequencing (scRNA-seq) data. Our quantification method features a novel R/Bioconductor data structure that can handle expression data across multiple immunogenetic annotation layers and enables interactive exploration of immune gene expression. We validate our methodology on multiple scRNA-seq datasets to demonstrate its performance across different experimental setups. We illustrate how these new tools allow us to study loss of HLA expression in tumor cells and to discover differential HLA allele expression in specific…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Genes1

HLA-S

Species1

Homo sapiens(human · species)

Diseases1

tumor

Figures4

Click any figure to enlarge with its caption.

Overview of the methodology (A) Overview of the integration between the two key tools in our methodology: the scIGD Snakemake workflow and the SCAE R/Bioconductor package. (B) Detailed schematic representation of scIGD’s processing steps, illustrating how raw sequencing data is processed to generate HLA allele-typing and quantification results. Nonimmune genes are quantified at the conventional gene level, whereas immune gene quantification is performed at the allele level. The output serves as input for downstream analysis in SCAE. (C) Detailed schematic representation of the multilayered data structure within SCAE, showing how HLA data is organized across allele, gene, and functional class layers. With the help of a lookup table, allele counts outputted by scIGD and belonging to the same gene are aggregated to populate the gene layer (1). Resulting gene counts belonging to the same functional class (e.g. HLA class I and class II) are further aggregated to populate the functional class layer (2).

File and directory structure in the scIGD workflow. (A) File and directory structure provided upon installation of the scIGD workflow. (B) Exemplary directory structure for setting up scIGD to analyze a WTA-based dataset. The input files required are highlighted in blue. While this layout serves as a guideline, the specific file paths can be customized within the config.yaml file. (C) Overview of the output files generated from running scIGD on a WTA-based dataset. The output files generated are highlighted in orange. This includes log files for each processing step, as well as the outputs generated by the tools involved in both HLA allele-typing and quantification stages. Directories with “…” indicate that additional files are present but have been collapsed for simplicity. The four key files required for downstream analysis with the SCAE package are highlighted in yellow and enclosed within a dashed box.

Validation of HLA allele-typing, specificity, and quantification in WTA-based data (A) Absolute frequency of HLA-A alleles typed in a subset of the AIDA dataset [19], including 14 samples: 8 KR (left panel) and 6 JP (right panel) individuals. HLA allele-typing was performed using the scIGD workflow, with no prior information about ethnicity or population. (B) Relative frequency of the most abundant HLA-A*24:02 allele group identified in panel (A) in different human reference populations from the Allele Frequency Net Database (AFND) [27]. The bar graph on the left shows the seven highest relative frequencies of the HLA-A*24:02 allele group in AFND. The map on the right shows a visualization of the allele group frequencies in human populations worldwide. (C) Comparison of HLA-A and HLA-DRA gene expression quantification using scIGD (x-axis) versus kallisto-bustools without using its -mm parameter (y-axis). Raw gene expression counts in the 20k PBMC dataset are shown. HLA gene expression in scIGD is calculated by aggregating the counts of alleles corresponding to each gene. Pearson correlation coefficient is indicated. (D) Assessment of HLA allele specificity in the Merkel cell carcinoma PBMC pre-treatment dataset. The plot displays normalized expressions of different HLA alleles, with alleles typed by arcasHLA [9] in this particular dataset highlighted in black, and those from an unrelated dataset in blue. (E) A heatmap illustrating the sequence similarity percentages of HLA allele sequences from panel (D), calculated with EMBOSS Needle [29].

Application of scIGD for biological discovery. (A, B) Results from the Merkel-cell carcinoma dataset, which includes PBMC (“pbmc_*”) and tumour (“tumor_*”) samples, pre- and post-treatment (“*_preRx” and “*_postRx”), to explore HLA loss in a solid tumor. (A) Visualization of the log-transformed normalized expression of HLA class I functional gene group using a t-SNE plot. t-SNE was generated from a batch-corrected dimensionality reduction using the MNN integration method and is faceted by sample. Notably, HLA class I expression is high in PBMC samples and significantly lower in tumor samples. (B) Bubble plot representation of HLA class I allele expression across the four samples from the same Merkel-cell carcinoma dataset as in panel (A). The color scale reflects the average of the log-transformed normalized expression values, highlighting differences in allele expression among the samples, while the size of the dots encodes the proportion of cells expressing HLA class I. (C–F) Results from the 20k PBMC dataset to assess HLA allele-specific expression in different cell types and subsets. (C) Cluster annotation on t-SNE. Louvain clustering was performed using the SNN graph constructed from the PCA results. Cluster annotation was achieved using candidate marker genes identified through differential expression testing between pairs of clusters. (D) Log-transformed normalized expression values of HLA-DQB1 alleles in the three monocyte subclusters from the 20k PBMC dataset. (E) Boxplots displaying the distribution of the HLA-DQB1*02:02:01:01/HLA-DQB1*06:03:01:01 expression ratio in single cells of the three monocyte subclusters. (F) Heatmap showing the centered mean expression values of the most important marker genes for the three monocyte subclusters.

Tables1

Table 1.. Summary of scRNA-seq datasets analyzed using the scIGD workflow

Dataset	Library	Experimental setup	Multiplexed	Number of donors	Longitudinal study	Reference
AIDA	10 × 5` v2	WTA	No	14	No	Kock et al. 2025
Merkel-cell carcinoma	10 × 3` v2	WTA	No	1	Yes	Paulson et al. 2018
20k PMBC	10 × 3` v3	WTA	No	1	No	10X Genomics
Multiple myeloma	BD Rhapsody	Amplicon	Yes	8	No	Enssle et al. 2024

Funding7

—Chan Zuckerberg Initiative DAF
—Silicon Valley Community Foundation10.13039/100000923
—Mildred Scheel Early Career Center
—LOEWE Center Frankfurt Cancer Institute
—Hessen State Ministry for Higher Education, Research and the Arts
—Deutsche Forschungsgemeinschaft10.13039/501100001659
—Goethe University Frankfurt

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSingle-cell and spatial transcriptomics · T-cell and B-cell Immunology · vaccines and immunoinformatics approaches

Full text

Introduction

Immune genes play a critical role in mediating antigen recognition and exhibit a high degree of structural diversity, which is essential for adapting to the vast array of pathogens encountered by the immune system. This diversity is achieved through two genetic mechanisms: polygeny and hyperpolymorphism [1]. Polygeny refers to the presence of multiple functionally similar genes that encode similar protein subunits, as for example IGLC2 and IGLC7 genes which both encode an immunoglobulin light chain constant domain. On the other hand, hyperpolymorphism denotes the presence of numerous allelic variants of the same gene across individuals. As a key example of this immunogenomic diversity, we focus on the major histocompatibility complex (MHC) in this study. Also known as human leukocyte antigens (HLAs) in humans, the MHC molecules present antigens to T cells, thereby modulating adaptive immune responses. The HLA region is located on chromosome 6 and includes >200 genes [2, 3]. This includes three principal class I genes (HLA-A, HLA-B, and HLA-C) and three pairs of class II genes (HLA-DR, HLA-DP, and HLA-DQ). These genes exhibit remarkable diversity, with around 28 000 HLA class I alleles and ∼12 400 HLA class II alleles identified to date [2, 3].

Most individuals are heterozygous for HLA genes, inheriting different alleles from each parent. Codominant expression of HLA alleles, where both alleles are equally expressed, enables individuals to present a broader repertoire of antigens [1]. HLA class I molecules, expressed on nearly all nucleated cells, present antigens to CD8 + T cells, while HLA class II molecules, predominantly found on specialized immune cells like dendritic cells, macrophages, and B cells, present antigens to CD4 + T cells.

With the growing relevance of immunotherapy in cancer treatment, precise HLA typing and quantification of expression levels have become critical because they enable a better understanding of how HLAs present tumor neoantigens on the cell surface [4]. Several tools have therefore been developed for HLA typing from sequencing data, including seq2HLA [5], OptiType [6], PHLAT [7], HLAProfiler [8], and arcasHLA [9], with benchmarking studies indicating arcasHLA’s consistent performance [10, 11]. Additionally, tools like scHLAcount construct personalized references for HLA allele-specific quantification [12]. However, there remains a significant gap in tools designed to streamline HLA allele-typing and quantification specifically for single-cell RNA sequencing (scRNA-seq) data. Existing methods lack an integrated, user-friendly pipeline that supports both HLA allele-typing and allele-specific expression quantification, and enables interactive exploration of immune gene expression.

In this work, we present a novel comprehensive workflow to bridge the gap in analyzing HLA expression from scRNA-seq data. Our approach not only automates HLA allele-typing and enables allele-specific quantification from scRNA-seq data using a Snakemake-based [13] workflow, scIGD (single-cell ImmunoGenomic Diversity), but also offers an innovative multilayer data structure implemented in the R/Bioconductor package, SingleCellAlleleExperiment (SCAE), which allows for the representation of HLA at multiple levels, ranging from alleles to genes and functional classes. This comprehensive functionality is covered by none of the existing HLA typing or quantification tools [5–12]. SCAE builds directly on the SingleCellExperiment (SCE) [14] class and thereby ensures compatibility with existing Bioconductor tools for single-cell analysis. Example datasets to assist users in testing the workflow are available in our corresponding R/ExperimentHub [15] data package, scaeData.

By capturing HLA allele-specific variations and providing a detailed view of HLA expression, our novel methodology is able to detect HLA allele-specific expression differences across immune cell subsets, as well as HLA loss in tumor samples. Notably, it enables a more accurate and deeper exploration of immune functions and responses, facilitating data-driven immunological research.

Materials and methods

Software description

The Snakemake workflow of scIGD is compatible with Unix-based systems, while SCAE and scaeData are platform-independent, supporting Windows, macOS, and Linux. Comprehensive documentation and usage examples are available under the respective links:

scIGD: https://github.com/AGImkeller/scIGD;

SCAE: https://bioconductor.org/packages/SingleCellAlleleExperiment;

scaeData: https://bioconductor.org/packages/scaeData.

All data and code is also archived under https://doi.org/10.5281/zenodo.16919082.

scIGD

scIGD (single-cell ImmunoGenomic Diversity) is a Snakemake-based workflow designed to automate the typing of HLA alleles and enable allele-specific quantification from scRNA-seq data (Fig. 1). The primary input for scIGD consists of raw, gzipped FASTQ files, which are processed through the workflow’s various stages to generate allele-typing and quantification results.

Overview of the methodology (A) Overview of the integration between the two key tools in our methodology: the scIGD Snakemake workflow and the SCAE R/Bioconductor package. (B) Detailed schematic representation of scIGD’s processing steps, illustrating how raw sequencing data is processed to generate HLA allele-typing and quantification results. Nonimmune genes are quantified at the conventional gene level, whereas immune gene quantification is performed at the allele level. The output serves as input for downstream analysis in SCAE. (C) Detailed schematic representation of the multilayered data structure within SCAE, showing how HLA data is organized across allele, gene, and functional class layers. With the help of a lookup table, allele counts outputted by scIGD and belonging to the same gene are aggregated to populate the gene layer (1). Resulting gene counts belonging to the same functional class (e.g. HLA class I and class II) are further aggregated to populate the functional class layer (2).

The workflow is organized into three distinct stages (Demultiplexing, HLA allele-typing, and Quantification), each addressing specific objectives.

Demultiplexing

The first stage of the scIGD workflow focuses on demultiplexing scRNA-seq datasets which contain reads from multiple donors that were barcoded using sample tags (for example, using the BD Single-Cell Multiplexing Kit). This step is omitted if the input data does not contain sample tag information or consists of only a single donor. The primary goal of this step is to generate donor-specific FASTQ files, which are essential for subsequent allele typing.

To perform demultiplexing, the user must provide two key inputs: the raw, gzipped scRNA-seq FASTQ files (R1 and R2) and a sample tag sequences FASTA file, which contains the unique sequence tags corresponding to each sample. Using kallisto-bustools [16], the input data is processed to produce a count matrix that links cell barcodes with sample tags. For each cell barcode, the method selects the sample tag ID that accounts for at least 75% of the read count. If no single sample tag meets this threshold, the cell is classified as a multiplet and excluded from further analysis.

After the correct sample tag is assigned to each cell barcode, the workflow matches read IDs with cell barcodes and sample tags and splits the original FASTQ files into multiple donor-specific FASTQ files, one for each sample tag. These donor-specific files are then used as inputs for the next stages of the workflow. Demultiplexing in scIGD is currently implemented for scRNA-seq datasets barcoded using the BD Single-Cell Multiplexing Kit.

HLA allele-typing

The second stage of the scIGD workflow focuses on HLA allele-typing. Our workflow supports both amplicon-based and whole transcriptome-based (WTA) scRNA-seq data.

For amplicon-based scRNA-seq, scIGD implements a minimal allele-typing which can only detect allelic differences located in the part of the amplicon that is covered by the respective sequencing approach. A FASTA file containing the complementary DNA (cDNA) sequences of target amplicons is required as input. For a defined set of amplicons of interest, the workflow infers the one or two most abundant transcript variants that can be distinguished given the implemented sequencing coverage. Since these transcript variants typically do not allow identification of reference alleles, the transcript variants are assigned arbitrary allele identifiers (e.g. HLA-A-AV-1, HLA-A-AV-2, HLA-A-AV-3, where AV refers to Allelic Variant).

For WTA scRNA-seq data, a more comprehensive HLA allele-typing is performed using the existing arcasHLA methodology [9]. This stage utilizes a reference genome FASTA file and matching GTF file with genomic annotations. Both files should be downloaded from the Ensembl database [17]. The STAR aligner [18] is used to map measured sequencing reads to the reference genome. arcasHLA then extracts those reads that map to chromosome 6, which contains the HLA loci, and uses the IMGT/HLA database [2, 3] and an expectation–maximization model to determine the combination of HLA alleles that best explains the measured sequencing data.

The allele-typing stage of the scIGD workflow generates a donor-specific reference cDNA FASTA file in which the immune gene references (i.e. the HLA reference genes) are replaced by the cDNA sequences of the previously typed alleles.

Quantification

The final stage of the scIGD workflow focuses on the quantification of genes and typed alleles. This stage applies equally to amplicon-based and WTA scRNA-seq data.

kallisto-bustools [16] is used to perform alignment-free quantification of the transcripts represented in the donor-specific cDNA FASTA reference. Importantly, since the donor-specific cDNA reference may contain two alleles of the same gene, and thus very similar sequences, scIGD uses the -mm option of kallisto-bustools that includes reads that pseudo-align to multiple transcripts in the reference.

Finally, a lookup table is created to map each allele to its corresponding gene and functional class. This table is crucial for constructing the SCAE data structure used in all subsequent steps, because it defines the relationships between alleles, genes, and their functional classes.

Setting up the scIGD analysis workflow

To use scIGD, users must first set up a working directory and create an environment with the required software. We recommend using conda or mamba for managing the environment, and a detailed step-by-step guide for this process is available in the GitHub repository (https://github.com/AGImkeller/scIGD).

As a Snakemake workflow, scIGD includes a configuration file that allows users to adjust the workflow according to their specific experimental setup and data. This config file supports the customization of various parameters of the workflow. Detailed examples and descriptions of these parameters are also provided in the repository to help users with the setup.

The entire scIGD workflow can be executed with a single command. Upon installing scIGD, several files integral to the workflow are automatically set up (Fig. 2A), while additional input files must be supplied by the user. An example of the required input data for a WTA scRNA-seq dataset is shown in Fig. 2B. All output files generated by the workflow are stored in a user-defined output directory, along with detailed log files for each step. Figure 2C provides an overview of the output files produced when scIGD is run on a WTA dataset.

File and directory structure in the scIGD workflow. (A) File and directory structure provided upon installation of the scIGD workflow. (B) Exemplary directory structure for setting up scIGD to analyze a WTA-based dataset. The input files required are highlighted in blue. While this layout serves as a guideline, the specific file paths can be customized within the config.yaml file. (C) Overview of the output files generated from running scIGD on a WTA-based dataset. The output files generated are highlighted in orange. This includes log files for each processing step, as well as the outputs generated by the tools involved in both HLA allele-typing and quantification stages. Directories with “…” indicate that additional files are present but have been collapsed for simplicity. The four key files required for downstream analysis with the SCAE package are highlighted in yellow and enclosed within a dashed box.

As an illustrative example, the full scIGD workflow processed a 50 M-reads WTA dataset in ∼3 h using 32 cores and 40 GB RAM. An amplicon-based dataset of 280 M reads can be analysed in ∼80 min using 8 cores and 8 GB RAM.

SingleCellAlleleExperiment

The second component in our workflow is SCAE, an R/Bioconductor package that extends SCE [14] to support immune gene quantification data generated by scIGD. By building directly on SCE, SCAE inherits its robust data structure and compatibility with a vast array of existing Bioconductor packages for single-cell analysis. This ensures users can seamlessly integrate SCAE objects into established workflows without the need to rewrite or duplicate functionality, leveraging the extensive ecosystem of tools already designed for single-cell data.

SCAE offers a multilayer framework within a single data object, allowing users to analyze immune gene expression across different annotation levels. SCAE imports the count matrix and lookup table generated by scIGD. Following the relationships defined in the lookup table, it aggregates allele-level counts belonging to the same gene into gene-level counts. Resulting gene counts belonging to the same functional class (e.g. HLA class I and class II) are further aggregated into functional-level counts (Fig. 1C). Such a multilayer design provides flexibility, allowing users to focus on specific data layers and adapt analyses as needed, and at the same time makes this object compatible with the wealth of existing packages for single-cell data analysis in the Bioconductor ecosystem. SCAE also supports quality control (QC) and preprocessing steps like cell filtering by knee plot, and normalization that accounts for the extended feature set. The scaeData package provides curated example datasets to serve as a practical resource for users to explore the features and capabilities of the SCAE package.

Description of the presented datasets

With the exception of the demultiplexing step, the scIGD workflow is compatible with both 10x Genomics (different chemistry types are supported) and BD Rhapsody platforms, supporting WTA and amplicon-based scRNA-seq data.

We validated the performance on four different datasets: 14 donors from the AIDA cohort [8 Koreans (KR), 6 Japaneses (JP), chosen from the complete cohort based on highest sequencing depth] [19], the Merkel-cell carcinoma dataset (peripheral blood mononuclear cell [PBMC] and tumor samples, pre- and post-treatment) [20], the 20k PBMC dataset [21], and a multiplexed Multiple myeloma dataset with 8 donors [22] (Table 1).

Preprocessing of datasets after the scIGD quantification

We applied the functionality of SCAE and performed downstream analysis on the Merkel-cell carcinoma and 20k PBMC datasets. During the SCAE object construction, we filtered out empty droplets by setting filter_mode=“yes”, which removes barcodes below the inflection point of the knee plot. Additionally, we set log = TRUE to normalize and log-transform the raw read counts. Because SCAE extends the count matrix to multiple partially overlapping feature layers, it’s critical that normalization is performed using this built-in function which operates only on the original input data—excluding the aggregated feature layers. This ensures that HLA features are not counted multiple times and thus do not disproportionately affect the normalization step. We then performed comprehensive preprocessing, including the exploration of QC metrics and subsequent filtering of cells and features for robust downstream analysis.

For the Merkel-cell carcinoma dataset, all four samples were merged into a single SCAE object. After batch correction using the mutual nearest neighbors (MNN) method [23], we performed principal component analysis (PCA) on highly variable genes to reduce the dimensionality of the data while identifying and retaining main sources of variation among the cells. The results were projected onto a t-distributed stochastic neighbor embedding (*t-*SNE). To ensure accurate comparisons across samples, we normalized and logarithmically transformed the counts using multiBatchNorm, accounting for the sample information [23].

For the 20k PBMC dataset, raw counts were normalized and logarithmically transformed using logNormCounts [24]. PCA was performed on the highly variable genes prior to constructing the shared nearest neighbor (SNN) graph and performing Louvain clustering with igraph [25]. The Louvain clustering result with *k *= 50 returned a reasonable number of clusters and was used in downstream analysis. Importantly, since SCAE builds directly on the SCE class, we were able to use existing SCE-based functions and methods for all analysis steps mentioned above.

Recommendations for QC of quantification results

When using scIGD workflow, we recommend applying state-of-the-art QC steps that are established for standard scRNA-seq workflows. For example, empty-droplet filtering should be performed using the ranking of barcodes based on unique molecular identifier (UMI) counts (knee plot) to ensure the removal of low-UMI barcodes. Low-quality sequences and dying cells should be removed by filtering on the fraction of mitochondrial reads per cell. In addition, there are a few QC aspects to consider regarding the allele-specific quantification. Since the reliability of the HLA typing and quantification relies on the overall HLA expression and sequencing depth in the dataset, it is recommended to restrict the analysis to datasets where the median total count of HLA-specific reads per cell is sufficiently high. As an internal control for data quality, we recommend to confirm that expected biological patterns are preserved in the data, such as for example higher expression of HLA class II genes in professional antigen-presenting cells.

Results

Our newly developed workflow comprises two software tools: scIGD, implemented using Snakemake, and SCAE, an R/Bioconductor package (Fig. 1A). scIGD is designed as a Snakemake workflow to perform HLA allele-typing and allele-specific quantification from scRNA-seq data (Fig. 1B). On the other hand, SCAE is designed to run entirely within an R environment, allowing for multilayer representations of immune gene quantification data, ranging from allele- to gene- and functional-level annotations (Fig. 1C).

scIGD accurately types and quantifies HLA alleles in a broad range of single-cell sequencing datasets

scIGD performs HLA-allele typing to determine the HLA alleles expressed by an individual using arcasHLA [9]. This tool was chosen based on its high accuracy in benchmarking studies of HLA typing on single-cell data [10, 11]. The HLA typing result is then used as a donor-specific reference for expression quantification using kallisto-bustools [16]. We tested the allele typing as well as the subsequent quantification separately to ensure each step’s reliability. Accurate allele typing is crucial as it defines the reference for quantification, while accurate quantification is essential for reliable downstream analyses of expression patterns.

To validate both steps, we processed four distinct datasets representing various conditions using scIGD (Table 1). These included the AIDA cohort to assess performance across different ethnic backgrounds, the 20k PBMC dataset for whole-transcriptome data, the Multiple myeloma dataset for amplicon-based data, and the Merkel-cell carcinoma dataset to test compatibility with longitudinal samples. This allowed us to assess scIGD across different sequencing platforms and experimental setups.

HLA allele typing using arcasHLA

To confirm the correct integration of arcasHLA into our workflow and showcase the HLA allele-typing performance of scIGD, we first applied our method to a subset of the AIDA cohort [19] that included 14 scRNA-seq datasets (8 from KR and 6 from JP individuals). The allele-typing results revealed common HLA-A and HLA-C alleles shared across both populations (Fig. 3A and Supplementary Fig. S1A), consistent with their geographical origin and shared ancestry. For instance, three out of the eight KR and four out of the six JP individuals carried the A24:02:01:01* allele, which was also overall most abundant in the allele-typing results. To further explore the population-wide distribution of this allele, we used the immunotation package [26] to determine its frequency across various human reference populations from the Allele Frequency Net Database [27] (Fig. 3B). As anticipated, the A24:02* allele group was found to be more frequent in East Asian populations or those of East Asian descent compared to other populations. Similarly, we conducted this analysis for the C01:02* allele group, and observed comparable results, with a higher frequency in East Asian populations (Supplementary Fig. S1B).

Validation of HLA allele-typing, specificity, and quantification in WTA-based data (A) Absolute frequency of HLA-A alleles typed in a subset of the AIDA dataset [19], including 14 samples: 8 KR (left panel) and 6 JP (right panel) individuals. HLA allele-typing was performed using the scIGD workflow, with no prior information about ethnicity or population. (B) Relative frequency of the most abundant HLA-A24:02 allele group identified in panel (A) in different human reference populations from the Allele Frequency Net Database (AFND) [27]. The bar graph on the left shows the seven highest relative frequencies of the HLA-A24:02 allele group in AFND. The map on the right shows a visualization of the allele group frequencies in human populations worldwide. (C) Comparison of HLA-A and HLA-DRA gene expression quantification using scIGD (x-axis) versus kallisto-bustools without using its -mm parameter (y-axis). Raw gene expression counts in the 20k PBMC dataset are shown. HLA gene expression in scIGD is calculated by aggregating the counts of alleles corresponding to each gene. Pearson correlation coefficient is indicated. (D) Assessment of HLA allele specificity in the Merkel cell carcinoma PBMC pre-treatment dataset. The plot displays normalized expressions of different HLA alleles, with alleles typed by arcasHLA [9] in this particular dataset highlighted in black, and those from an unrelated dataset in blue. (E) A heatmap illustrating the sequence similarity percentages of HLA allele sequences from panel (D), calculated with EMBOSS Needle [29].

Validation of the HLA expression quantification

Following allele-typing, the scIGD workflow quantifies expression levels of all genes and HLA alleles using a donor-specific reference built from the allele typing result. The HLA expression at the gene level is then calculated by aggregating the counts of all alleles belonging to the same HLA gene.

We tested whether this allele aggregation approach for gene-level quantification also correlated with the HLA gene expression levels obtained by conventional quantification methods for 10x Genomics (Fig. 3C) and BD Rhapsody single cell sequencing data (Supplementary Fig. S1C). To assess how our HLA gene counts compare to those obtained using kallisto-bustools, where no allele-specific information is provided and gene quantification is based solely on a default reference genome, we compared the raw gene expression counts of HLA-A and HLA-DRA genes (Fig. 3C) as well as HLA-C genes (Supplementary Fig. S1C). The Pearson correlation coefficient was close to 1, indicating that the HLA gene-level expression results derived from scIGD’s aggregated allele counts were highly comparable to those from conventional quantification methods.

Specificity of the HLA allele quantification and technological limitations

In the quantification step of scIGD, we include reads that pseudo-align to multiple transcripts using the -mm option of kallisto-bustools. This is necessary because the HLA alleles in the donor-specific reference files are very similar in sequence and their expression would otherwise be underestimated.

One of the inherent limitations of allele-specific quantification is the high level of sequence similarity between particular HLA alleles. It is expected that the differential quantification accuracy is reduced with increasing sequence similarity of the HLA alleles included in the quantification. In order to illustrate this aspect for our quantification approach, we next examined how scIGD quantified unrelated HLA alleles that were not identified in HLA allele-typing. For this, we quantified expression levels for alleles identified in the HLA typing step in the 20k PBMC dataset (colored in black) alongside unrelated alleles with different levels of sequence similarity to the original alleles (colored in blue) (Fig. 3D). The alleles corresponding to the 20k PBMC dataset generally showed higher expression levels compared to the unrelated alleles, with exceptions observed for unrelated alleles HLA-A02:01* and HLA-C16:01*, which exhibited unexpectedly high expression levels.

We further assessed whether these unexpected expression levels could be attributed to sequence similarity. We aligned all allele sequences used in Fig. 3D using Clustal Omega for multiple sequence alignment [28] (Supplementary File 1). Furthermore, we calculated pairwise sequence similarity percentages with EMBOSS Needle [29] (Fig. 3E). Our analysis suggested that high similarity near the poly-A tail of a transcript has the highest influence on quantification specificity (Supplementary File 1).

A similar validation approach was applied to the amplicon-based Multiple Myeloma dataset [22], containing eight donors. Here, we examined the expression of two distinct HLA-C alleles and the HLA-C gene across all donors (Supplementary Fig. S1D). As described in the methods section, the transcript variants identified in the amplicon-based sequencing typically do not allow identification of reference alleles and are therefore assigned arbitrary allele identifiers. Allele I, identified in donors D3 and D4, showed notably higher expression levels in these two donors than in the others. Likewise, Allele II was primarily expressed in donors D1, D5, and D7. Despite these allele-specific differences, HLA-C gene-level expression remained uniform across all donors.

Biological application and downstream analysis using SCAE

After performing HLA allele-typing and expression quantification with scIGD, we used the resulting quantifications to showcase how our method can be used to explore different immunological questions. For this, we applied the functionality of SCAE and performed downstream analyses on the Merkel-cell carcinoma and 20k PBMC datasets.

SCAE offers a multilayer data framework that allows immune genes to be represented across multiple annotation levels of alleles, genes, and functional classes (Fig. 1C). At the allele level, counts are specific to each individual allele, offering detailed insights into allele-specific expression across cells within one donor. Moving up to the gene level, allele counts are aggregated to reflect the total expression of each HLA gene, giving a broader view of gene expression without focusing on individual alleles and allowing comparisons between donors. Finally, at the functional class level, gene-level counts are further aggregated by functional class, representing overall expression for functional gene groups (e.g. HLA class I and class II) that correspond to specific immune functions. This multilayer design allows users to focus on specific immune gene annotation levels within the same data object during their downstream analyses. SCAE integrates with workflows based on SCE [14], making it compatible with a vast array of existing Bioconductor packages for single-cell analysis.

Detection of HLA loss in single cell sequencing data from solid tumor

We first demonstrate how the multiple layers can be combined to explore loss of HLA expression in tumors. For this, we used the Merkel-cell carcinoma dataset (PBMC and tumor samples, pre- and post-treatment) for which HLA loss has previously been reported [20]. The original study reported that in tumor samples, post-treatment expression of HLA-B was significantly downregulated, suggesting a specific immunologic pressure exerted by the immune system on HLA-B, particularly from the HLA-B35:02*-restricted CD8 + T cells that were transferred as part of the treatment [20].

After preprocessing, we retained 1794, 3439, 2118, and 4896 cells from PBMC pre- and post-treatment and tumor pre- and post-treatment samples, respectively. All four samples were merged into a single SCAE object for integrated analysis. HLA class I expression was highest in the immune cell population of PBMC samples and significantly lower in tumor samples composed mainly by tumor cells (Fig. 4A).

Application of scIGD for biological discovery. (A, B) Results from the Merkel-cell carcinoma dataset, which includes PBMC (“pbmc_”) and tumour (“tumor_”) samples, pre- and post-treatment (“_preRx” and “_postRx”), to explore HLA loss in a solid tumor. (A) Visualization of the log-transformed normalized expression of HLA class I functional gene group using a t-SNE plot. t-SNE was generated from a batch-corrected dimensionality reduction using the MNN integration method and is faceted by sample. Notably, HLA class I expression is high in PBMC samples and significantly lower in tumor samples. (B) Bubble plot representation of HLA class I allele expression across the four samples from the same Merkel-cell carcinoma dataset as in panel (A). The color scale reflects the average of the log-transformed normalized expression values, highlighting differences in allele expression among the samples, while the size of the dots encodes the proportion of cells expressing HLA class I. (C–F) Results from the 20k PBMC dataset to assess HLA allele-specific expression in different cell types and subsets. (C) Cluster annotation on t-SNE. Louvain clustering was performed using the SNN graph constructed from the PCA results. Cluster annotation was achieved using candidate marker genes identified through differential expression testing between pairs of clusters. (D) Log-transformed normalized expression values of HLA-DQB1 alleles in the three monocyte subclusters from the 20k PBMC dataset. (E) Boxplots displaying the distribution of the HLA-DQB102:02:01:01/HLA-DQB106:03:01:01 expression ratio in single cells of the three monocyte subclusters. (F) Heatmap showing the centered mean expression values of the most important marker genes for the three monocyte subclusters.

Our findings align with the original study’s observations, as we also detected a marked reduction in HLA class I gene expression among tumor post-treatment cells, specifically showing a complete loss of both HLA-B alleles (Fig. 4B). Additionally, our allele typing results consistently identified the allele reported in the original study (HLA-B*35:02) across multiple samples, further supporting the reliability of our method for accurate allele identification in single-cell sequencing data. Since we observed downregulation of both HLA-B alleles, the mechanism of HLA downregulation is likely not allele-specific in this particular example. In summary, the SCAE functionality allowed us to explore loss of HLA expression at both the gene and allele levels, supporting the conclusions drawn in the original study.

Identification of differential allele expression in a subset of antigen presenting cells

Next, we demonstrate the use of our method for detection of differential HLA allele expression in different cell types and subsets. For this, we used the 20k PBMC dataset that can also be found in our scaeData R/ExperimentHub package (https://bioconductor.org/packages/scaeData).

After preprocessing, 19765 cells were retained and cell-type annotation was used to identify the distinct immune subsets (Fig. 4C). This included three monocyte subsets, B cells, dendritic cells, NK cells, and two T cell subsets, which were characterized by distinct marker expression profiles (Supplementary Fig. S2).

We further explored the expression of HLA-DQB1 alleles in the three monocyte subsets (Fig. 4D). Monocytes I exhibited the lowest expression levels of HLA-DQB1, whereas Monocytes III had the highest (Fig. 4D). In addition, we observed differential HLA-DQB1 allele expression profiles between the subsets: Monocytes III had a higher ratio of HLA-DQB102:02 / HLA-DQB106:03, expression compared to the other two monocyte subsets (Fig. 4E).

The overall gene expression profiles of the different monocyte populations indeed also revealed functional heterogeneity (Fig. 4F). Monocytes I upregulated classical pro-inflammatory markers, such as S100A8, S100A9, VCAN, FOS, LYZ, and CD14, indicative of a role in acute inflammation. In contrast, Monocytes III showed increased expression of genes associated with HLA class II molecules and other markers like PDE10A and HINT1, suggesting they might be nonclassical, serving a more specialized role in antigen presentation or immune regulation.

These findings emphasize how allele-specific HLA expression can delineate functionally distinct subsets within the same cell type, revealing their distinct biological roles. This level of allele-specific resolution, detecting differential HLA allele expression across subsets of the same cell type, is a significant advantage of our method over traditional transcriptomic workflows, which focus on gene-level expression alone.

Discussion

In this study, we present a comprehensive workflow that enables allele-specific analysis of HLA expression in scRNA-seq data. By leveraging scIGD and SCAE, we demonstrate the utility of our tools across diverse scRNA-seq datasets, including whole-transcriptome 10x Genomics datasets and a multiplexed amplicon-based BD Rhapsody dataset. Our results highlight the ability of this workflow to not only accurately perform HLA allele-typing but also quantify allele-specific expression, offering novel insights into human immunogenomic diversity.

We validated the workflow across multiple datasets and demonstrated the robustness and specificity of the allele typing and quantification. The strong correlation observed between gene-level quantification from our method and standard tools, such as kallisto-bustools [16], further supports the accuracy of our allele-specific quantification approach. Across both whole-transcriptome and amplicon-based datasets, our method was able to reliably detect allele-specific expression differences. Moreover, our methodology reliably recapitulates findings from previous studies such as the loss of HLA-B expression in a Merkel-cell carcinoma case.

Our methodology presents an advancement over existing tools such as seq2HLA [5], OptiType [6], PHLAT [7], arcasHLA [9], and scHLAcount [12] because it (i) integrates both typing and allele-specific quantification within a single workflow, and (ii) offers a data structure optimized for interactive single-cell downstream analysis of different immunogenetic features. Of note, SCAE can be integrated with interactive data exploration tools such as iSEE [30]. One of the key assets of our analysis workflow is the capacity to detect HLA expression differences across subsets of the same immune cell type, particularly within the monocyte subsets in the 20k PBMC dataset. This type of allele-specific expression difference is not detectable with traditional transcriptomic workflows. By capturing allele-specific variations, our method offers a tool for identifying samples where HLA alterations can be detected. These computational predictions typically require careful validation in downstream experimental studies to uncover functional implications of differential HLA allele expression.

Our study highlights the importance of integrating allele-level information into scRNA-seq analyses, particularly for important and structurally diverse immune mediators such as the HLA molecules. Future research could build on this work by extending our workflow to other highly polymorphic loci such as B and T cell receptors, and killer Ig-like receptors. Overall, the scIGD workflow for allele-specific quantification in scRNA-seq offers a powerful tool for exploring the expression of important immune mediators across human donors at very high molecular detail.

Supplementary Material

lqaf149_Supplemental_Files

Bibliography30

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Murphy K, Weaver C. Janeway Immunologie. Springer Spektrum Berlin Heidelberg, 2018. 10.1007/978-3-662-56004-4. · doi ↗
2Robinson J, Barker DJ, Marsh SGE. 25 years of the IPD-IMGT/HLA database. HLA. 2024;103:e 15549. 10.1111/tan.15549.38936817 · doi ↗ · pubmed ↗
3Barker DJ, Maccari G, Georgiou X et al. The IPD-IMGT/HLA database. Nucleic Acids Res. 2023;51:D 1053–60. 10.1093/nar/gkac 1011.36350643 PMC 9825470 · doi ↗ · pubmed ↗
4Shohdy KS, Atherton J, Longland J et al. Patient-specific HLA-I subtypes predict response to immune checkpoint blockade. Oncoimmunology. 2025;14:2462386. 10.1080/2162402 X.2025.2462386.39981665 PMC 11849918 · doi ↗ · pubmed ↗
5Boegel S, Löwer M, Schäfer M et al. HLA typing from RNA-seq sequence reads. Genome Med. 2012;4:102. 10.1186/gm 403.23259685 PMC 4064318 · doi ↗ · pubmed ↗
6Szolek A, Schubert B, Mohr C et al. Opti Type: precision HLA typing from next-generation sequencing data. Bioinformatics. 2014;30:3310–6. 10.1093/bioinformatics/btu 548.25143287 PMC 4441069 · doi ↗ · pubmed ↗
7Bai Y, Ni M, Cooper B et al. Inference of high resolution HLA types using genome-wide RNA or DNA sequencing reads. BMC Genomics. 2014;15:1–16. 10.1186/1471-2164-15-325.24884790 PMC 4035057 · doi ↗ · pubmed ↗
8Buchkovich ML, Brown CC, Robasky K et al. HLA Profiler utilizes k-mer profiles to improve HLA calling accuracy for rare and common alleles in RNA-seq data. Genome Med. 2017;9:86. 10.1186/s 13073-017-0473-6.28954626 PMC 5618726 · doi ↗ · pubmed ↗