Clustering-based identification of immune-related gene signatures in hepatocellular carcinoma
Jyoti Brahmaiah, Usha Adiga, Alfred J Augustine, Sampara Vasishta

TL;DR
This study identifies immune-related gene signatures in liver cancer using clustering methods, highlighting their role in tumor progression and potential for immunotherapy.
Contribution
The novel use of multiple clustering techniques to uncover coordinated immune gene clusters in hepatocellular carcinoma.
Findings
MHC class II genes formed a distinct cluster using K-means clustering.
MCL and DBSCAN revealed unified clusters involving both MHC class I and II molecules.
CD4, CD74, and HLA-DQA1 were identified as central nodes in immune gene regulatory networks.
Abstract
Hepatocellular carcinoma (HCC) is a complex malignancy influenced by genetic, epigenetic and immune-related factors. The tumour immune microenvironment plays a critical role in HCC progression and response to immunotherapy. Identifying key immune-related gene signatures through clustering techniques can provide insights into tumour biology and therapeutic targets. We employed K-means, Markov Clustering Algorithm (MCL) and density-based spatial clustering of applications with noise (DBSCAN) to analyse immune-related genes in HCC. Functional enrichment analysis was conducted using Gene Ontology (GO) biological process, cellular component and molecular function categories, along with pathway analysis from Kyoto encyclopedia of genes and genomes (KEGG) and Reactome databases. Additionally, protein–protein interaction (PPI) hub analysis and microRNAs (miRNA) target predictions were…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLiver Disease Diagnosis and Treatment · Bioinformatics and Genomic Networks · Machine Learning in Bioinformatics
Introduction
Hepatocellular carcinoma (HCC) is the most prevalent form of primary liver cancer and represents a significant global health burden. It is strongly associated with chronic liver diseases, including hepatitis B virus (HBV) and hepatitis C virus (HCV) infections, Metabolic dysfunction-associated steatotic liver disease (MASLD) and alcohol-related liver disease [1] Although the present incidence (2.15 per 100,000), prevalence (2.27 per 100,000) and mortality (2.21 per 100,000) rates of HCC in India are lower than the global figures, the yearly rates of change in these measures are greater in India [2].
Epidemiology and risk factors
The incidence of HCC varies globally, with higher rates observed in regions with endemic HBV and HCV infections, such as East Asia and sub-Saharan Africa [3]. Chronic HBV infection remains the leading cause of HCC, contributing to approximately 50% of cases worldwide, while chronic HCV infection accounts for 25%–30% of cases [4]. In recent years, MASLD has emerged as a critical risk factor, particularly in Western countries, where obesity and metabolic syndrome are prevalent [5]. Other risk factors include excessive alcohol consumption, aflatoxin exposure, diabetes and genetic predisposition [6].
Molecular pathogenesis of HCC
HCC arises through a multistep process involving genetic and epigenetic alterations that drive malignant transformation. Key molecular pathways implicated in HCC development include Wnt/β-catenin signaling, TP53 mutations, oxidative stress and inflammation-mediated hepatocarcinogenesis [7]. The integration of HBV DNA into the host genome is a well-established oncogenic driver that promotes genomic instability and tumour progression [8]. Additionally, mutations in genes such as TERT, CTNNB1 and TP53 are frequently observed in HCC tumours, further highlighting the heterogeneous nature of the disease [9].
Role of genetic variants in HCC development
Genome-wide association studies have identified several genetic variants associated with increased susceptibility to HCC. The PNPLA3 rs738409 C>G polymorphism has been strongly linked to MASLD-associated HCC, with studies demonstrating that carriers of the G allele exhibit increased hepatic fat accumulation and fibrosis progression [10]. Similarly, TM6SF2 rs58542926 has been associated with hepatic steatosis and an elevated risk of fibrosis, contributing to HCC development in patients with MASLD [11]. Other significant variants, such as MBOAT7 rs641738 and HSD17B13 rs72613567, have been implicated in modifying liver disease progression and influencing HCC susceptibility [12,13].
Objectives
To analyse the biological processes (BPs), cellular components (CCs) and molecular functions (MFs) of clustered immune genes, focusing on antigen presentation, major histocompatibility complex (MHC) class I and II interactions and immune response pathways in HCC.To explore key immune pathways enriched in HCC, including PD-1 signaling, antigen processing and presentation, interferon gamma signaling and glutamate receptor signaling, to assess their role in immune evasion and tumour progression.To investigate the involvement of microRNAs (miRNAs) in HCC immune regulation, identify specific miRNAs that target immunerelated genes and contribute to immune modulation in the tumour microenvironment.To identify protein–protein interaction (PPI) hub proteins linked to immune regulation in HCC, highlighting key molecules that may serve as central regulators of immune response and potential therapeutic targets.To compare the clustering effectiveness of K-means, Markov Clustering Algorithm (MCL) and density-based spatial clustering of applications with noise (DBSCAN) in classifying immune-related genes in HCC, and evaluate the similarities and differences between these approaches in capturing biologically relevant gene interactions.
Methodology
Study design and data acquisition
This study employed a bioinformatics-driven approach to analyse HCC data, focusing on gene expression, pathway enrichment and protein interaction networks. The dataset was obtained from publicly available genomic repositories [14], including gene expression Omnibus and the cancer genome atlas. The selection criteria included datasets with well-defined sample classifications, including HCC tumour tissues and adjacent normal liver tissues. The inclusion of high-throughput sequencing and microarray data allowed for a comprehensive assessment of gene expression variations, molecular interactions and pathway deregulations.
Differential gene expression analysis
Differential expression analysis was conducted using DESeq2 for RNA-Seq data and limma for microarray data. Genes with an adjusted p-value (false discovery rate, FDR) of <0.05 and a log2 fold change of >1 or <−1 were considered significantly differentially expressed. The lists of upregulated and downregulated genes were further validated by cross-referencing with existing literature on HCC.
Functional enrichment analysis
Gene ontology (GO) enrichment and Kyoto encyclopedia of genes and genomes (KEGG) pathway analysis were performed using ClusterProfiler in R. The GO analysis categorised the significantly expressed genes into three domains: BP, CC and MF. The KEGG analysis identified significantly enriched pathways related to oncogenesis, immune response and metabolic dysregulation in HCC.
For immune-related processes, GO terms such as ‘Antigen Processing and Presentation via MHC Class II’ were analysed to determine the role of HLA-DQA1 in tumour immune evasion. Additionally, pathways related to neurotransmitter signaling, such as ‘Glutamate Receptor Signaling’, were assessed for their involvement in HCC progression.
PPI network construction
To identify key hub proteins and their potential role in hepatocarcinogenesis, a PPI network was constructed using STRING and visualised using Cytoscape. The MCODE algorithm was applied to detect highly interconnected clusters of proteins. Key hub proteins such as FLNA, replication timing regulatory factor 1 (RIF1) and DLG4 were identified based on their high degree of connectivity.
miRNA target prediction and regulatory network analysis
To assess post-transcriptional regulation in HCC, miRNA enrichment analysis was conducted using miRTarBase 2017 and TargetScan databases. Significant interactions were identified between HLA-DQA1 and miRNAs such as hsa-miR-6798-3p and hsa-miR-4645-5p, which have been implicated in immune modulation. Similarly, hsa-miR-122-5p, a liver-specific miRNA, was examined for its role in tumour progression.
The regulatory network was constructed by integrating mRNA, miRNA and PPI interactions, providing insights into how miRNAs influence oncogenic pathways in HCC.
Statistical analysis
All statistical analyses were performed using R (version 4.1.2) and Python (version 3.9). A p-value of <0.05 was considered statistically significant. Multiple testing correction was applied using the Benjamini–Hochberg (BH) method to control for false discovery rates.
In this study, we applied three clustering techniques—K-means, MCL and DBSCAN—to classify immune-related genes associated with HCC.
Ethical considerations
This is not a clinical case report but a bioinformatics-based analysis using publicly available datasets from repositories. No direct patient recruitment, intervention or identifiable clinical data were involved; therefore, informed consent and institutional ethical approval were not required. All datasets used were generated under the ethical guidelines of their respective repositories.
Results
GO BP analysis
GO enrichment analysis identified several BP significantly associated with the dataset. The highest combined scores were observed for MHC Class II protein complex assembly and peptide antigen assembly with MHC Class II protein complex, both showing strong enrichment and relatively high overlap ratios. Other notable processes included peptide antigen assembly with MHC protein complex and immunoglobulin production involved in the immunoglobulin-mediated immune response, indicating a clear link to antigen presentation and immune activation pathways. Processes related to antigen processing and presentation of exogenous peptide antigen via MHC Class II, synaptic transmission (glutamatergic) and glutamate receptor signaling pathway were also enriched, though with comparatively lower combined scores and overlap ratios. Overall, the enrichment profile highlights a predominance of immune-related pathways, particularly those linked to MHC Class II-mediated antigen processing and presentation. This suggests that the dataset is functionally enriched for processes central to adaptive immune responses (Figure 1).
GO CC analysis
The enrichment analysis for CCs indicated the strongest associations with MHC Class II protein complex and MHC protein complex, both showing the highest combined scores and overlap ratios. Additional enriched components included the lumenal side of the endoplasmic reticulum membrane and the ionotropic glutamate receptor complex, suggesting a combination of immune-related and neuronal membrane-associated elements.
Other terms with notable enrichment were postsynaptic density membrane and postsynaptic specialisation membrane, reflecting potential involvement of synaptic structures. Lower combined score terms such as coated vesicle membrane, ER to Golgi transport vesicle membrane, transport vesicle membrane and clathrin-coated endocytic vesicle membrane indicate processes linked to intracellular transport and vesicular trafficking. The dataset is enriched for membrane-associated complexes, with predominant representation of MHC-related immune structures and secondary enrichment in synaptic and vesicular transport components. (Figure 2).
GO MF analysis
The MF analysis identified MHC Class II receptor activity as the most enriched term, with the highest combined score and overlap ratio. This was followed by ionotropic glutamate receptor activity and neurotransmitter receptor activity involved in the regulation of postsynaptic membrane potential, indicating functional links to both immune recognition and neuronal signaling. Other enriched terms included MHC Class II protein complex binding and transmitter-gated monoatomic ion channel activity involved in the regulation of postsynaptic membrane potential, further highlighting the interplay between antigen presentation and synaptic signaling. Lower combined score functions, such as sodium channel activity, ligand-gated monoatomic cation channel activity and potassium channel activity, point toward additional ion transport processes. The dataset is enriched for functions related to MHC Class II-mediated antigen recognition and binding, with secondary enrichment in neurotransmitter receptor activity and ion channel regulation (Figure 3).
KEGG pathway analysis
KEGG pathway analysis revealed significant enrichment in immune-related diseases, with the highest combined score observed for asthma, followed by allograft rejection and graft-versus-host disease. Other notable enriched pathways included type I diabetes mellitus, intestinal immune network for IgA production and autoimmune thyroid disease, highlighting strong associations with autoimmune and hypersensitivity conditions. Lower-ranked enriched terms comprised viral myocarditis, inflammatory bowel disease, leishmaniasis and antigen processing and presentation, indicating a broader involvement of the dataset in infectious and inflammatory immune pathways. The dataset shows predominant enrichment for KEGG pathways related to autoimmune, inflammatory and immune-mediated diseases, suggesting a central role in immune dysregulation and pathogen response (Figure 4).
miRTarBase 2017 analysis
The enriched miRNAs are represented in Table 1.
PPI hub proteins
PPI network analysis identified key hub proteins, with FLNA, RIF1 and DLG4 showing significant interactions with GRIK1 and HLA-DQA1. FLNA (Filamin A) plays a role in cytoskeletal organisation, while RIF1 is involved in DNA repair. These findings, represented in Table 2, suggest that alterations in these proteins could contribute to tumour progression and immune evasion in HCC. Tables 3–5 represent the various clustering methods, such as k-means, MCL and DBSCAN, respectively.
This bioinformatics analysis of HCC data provides a comprehensive understanding of the immune and neurological pathways implicated in disease progression. The enrichment of HLA-DQA1 in antigen processing pathways underscores its role in immune modulation, while GRIK1-associated neurotransmitter signaling highlights a potential novel mechanism in hepatocarcinogenesis. The identified miRNAs and hub proteins suggest additional regulatory layers influencing HCC pathophysiology. These findings provide valuable insights for future research, particularly in developing immunotherapy strategies and targeted treatments for HCC.
Discussion
Our analysis of HCC using various bioinformatics tools provides critical insights into the immune-related molecular mechanisms contributing to disease progression. The GO terms, KEGG pathway analysis, PPI hub proteins and miRNA regulatory network collectively highlight key immune system components involved in HCC pathogenesis.
The KEGG pathway results further support the notion that autoimmune and inflammatory pathways are closely tied to HCC development. Additionally, miRNA regulation of key immune genes may serve as a novel mechanism for immune evasion, offering potential therapeutic targets. The clustering analysis revealed three major insights:
Class II Predominance in K-means: The K-means approach isolated MHC class II genes into a single cluster, indicating their distinct expression pattern in HCC. This highlights their role in adaptive immunity and tumour immune escape.
Integration of class I and II in MCL and DBSCAN: Unlike K-means, both MCL and DBSCAN clustered MHC class I and II genes together. This suggests functional cross-talk between these molecules in shaping the tumour immune microenvironment (TIME).
CD4 and CD74 as key regulators: The consistent clustering of CD4 and CD74 with MHC genes underscores their pivotal role in antigen presentation and T-cell activation in HCC.
Differential clustering of MHC class I and II genes, the clustering methods employed demonstrated distinct grouping tendencies of MHC genes. K-means clustering separated MHC class II molecules into a single group, while MCL and DBSCAN clustered both class I and II genes together. This distinction suggests that while class II molecules may exhibit distinct transcriptional or post-transcriptional regulation in HCC, their interaction with class I molecules remains essential in immune recognition and tumour evasion. The presence of class I molecules such as HLA-B within the same cluster as class II genes in MCL and DBSCAN points to functional convergence in antigen processing and immune activation.
CD4 and CD74 as central players in immune modulation, the consistent clustering of CD4 and CD74 with MHC class II genes underscores their critical role in antigen presentation and T-cell activation. CD4, a key co-receptor for MHC class II molecules, is central to T-helper cell function, supporting immune surveillance and cytokine-mediated responses. Similarly, CD74, which acts as an MHC class II chaperone, facilitates antigen processing and presentation. Their co-clustering with HLA genes reinforces their role in orchestrating the immune response against HCC cells, potentially influencing patient prognosis and treatment response.
Potential implications for tumour immune microenvironment, the clustering of immune-related genes reveals patterns that may reflect the tumour microenvironment's immunological landscape. The integration of MHC class I and II genes, as seen in MCL and DBSCAN clustering, indicates a coordinated immune response that may be influenced by tumour-induced immune suppression. The immune escape mechanisms in HCC often involve the downregulation of MHC molecules, allowing tumour cells to evade cytotoxic T-cell recognition. Understanding these clustering patterns could help identify novel biomarkers and therapeutic targets to counteract immune evasion strategies in HCC.
Comparative effectiveness of clustering techniques, the similarities between MCL and DBSCAN suggest these methods are robust for detecting biologically relevant interactions among immune genes. Both approaches grouped class I and II molecules together, capturing the functional interplay between these components. In contrast, K-means identified a distinct cluster for class II molecules, possibly due to differences in transcriptional regulation or specific cellular pathways unique to antigen-presenting cells. These variations highlight the importance of method selection when analysing immune-related gene expression in cancer research.
The TIME plays a crucial role in HCC progression and response to therapy. HCC tumours exhibit immune suppression through multiple mechanisms, including the upregulation of immune checkpoint molecules such as PD-1, CTLA-4 and TIM-3, which contribute to T-cell exhaustion [15]. Additionally, alterations in antigen presentation pathways, such as downregulation of MHC class I and II molecules, allow tumour cells to evade immune surveillance [16]. Understanding the immune landscape of HCC is essential for the development of immunotherapeutic strategies.
Recent advancements in computational biology have enabled the clustering of immune-related genes in HCC, shedding light on distinct immune subtypes with potential therapeutic implications. Clustering approaches such as K-means, MCL and DBSCAN have been utilised to classify immune-related genes based on their expression profiles [17]. These techniques have identified key immune clusters involving MHC class I and II molecules, CD4 and CD74, which are integral to antigen presentation and T-cell activation [18].
The identification of immune-related gene clusters has significant implications for personalised therapy in HCC. Immunotherapy, particularly immune checkpoint inhibitors (ICIs) targeting PD-1 and CTLA-4, has revolutionised HCC treatment [19]. The combination of ICIs, such as tremelimumab with durvalumab, has demonstrated promising efficacy in patients with unresectable HCC [20]. Additionally, therapies targeting the Wnt/β-catenin pathway and tumour-associated macrophages are being explored as potential strategies to enhance antitumour immunity [21].
Limitations of the study
The findings presented in this study are derived entirely from in silico analyses, which, while powerful for identifying potential biological associations, are inherently hypothesis-generating in nature. As such, these results should be interpreted with caution, as they require experimental validation to confirm their biological relevance. Further studies using cellular and rodent models are essential to elucidate the underlying mechanisms and to assess causality. Moreover, validation in appropriate clinical cohorts will be necessary to determine the translational significance of these findings in human disease contexts.
Conclusion
This study underscores the intricate interplay between the immune system and HCC, with a strong emphasis on antigen presentation, immune evasion and neuroimmune interactions. The identification of key regulatory miRNAs and hub proteins provides promising avenues for further research into targeted therapies. Future investigations should focus on validating these findings through experimental and clinical studies to improve our understanding of HCC pathogenesis and treatment strategies.
Our study demonstrates the utility of clustering algorithms in identifying immune gene interactions in HCC. The integration of MHC class I and II molecules in MCL and DBSCAN suggests coordinated immune regulation, while K-means highlights distinct expression patterns. These insights contribute to understanding immune modulation in HCC and may guide future immunotherapeutic strategies.
Conflicts of interest
The authors declare no conflicts of interest related to this work.
Funding
This research did not receive any specific grant from funding agencies in the public, commercial or not-for-profit sectors.
Author contributions
Jyoti Brahmaiah – conceptualisation, data curation, formal analysis, original draft preparation; Usha Adiga – statistical analysis, methodology refinement, project administration, critical manuscript revision; Alfred J Augustine – supervision, validation, interpretation of results, critical feedback; Sampara Vasishta – literature review, bioinformatics analysis, visualisation, writing (review and editing), correspondence. All authors reviewed and approved the final manuscript.
Declaration of generative AI and AI-assisted technologies in the writing process
During the preparation of the articles, the authors used AI tools to reformulate some sentences, after which the authors reviewed and edited the content as needed and take full responsibility of the content in the article.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Abdelhamed W El-Kassas M Hepatitis B virus as a risk factor for hepatocellular carcinoma: there is still much work to do Liver Res 202482839010.1016/j.livres.2024.05.00439959873 PMC 11771266 · doi ↗ · pubmed ↗
- 2Giri S Singh A Epidemiology of hepatocellular carcinoma in India - an updated review for 2024 J Clin Exp Hepatol 202414610144710.1016/j.jceh.2024.10144738957612 PMC 11215952 · doi ↗ · pubmed ↗
- 3Trépo E Nahon P Bontempi G Association between the PNPLA 3 (rs 738409 C>G) variant and hepatocellular carcinoma: evidence from a meta-analysis of individual participant data Hepatology 2014592170217710.1002/hep.2676724114809 · doi ↗ · pubmed ↗
- 4Burza MA Pirazzi C Maglio CPNPLA 3 I 148M (rs 738409) genetic variant is associated with hepatocellular carcinoma in obese individuals Dig Liver Dis 2012441037104110.1016/j.dld.2012.05.00622704398 · doi ↗ · pubmed ↗
- 5Liu YL Patman GL Leathart JBS Carriage of the PNPLA 3 rs 738409 C>G polymorphism confers an increased risk of non-alcoholic fatty liver disease-associated hepatocellular carcinoma J Hepatol 201461758110.1016/j.jhep.2014.02.03024607626 · doi ↗ · pubmed ↗
- 6Seko Y Sumida Y Tanaka S Development of hepatocellular carcinoma in Japanese patients with biopsy-proven non-alcoholic fatty liver disease: association between PNPLA 3 genotype and hepatocarcinogenesis/fibrosis progression Hepatol Res 2017471083109210.1111/hepr.1284027862719 · doi ↗ · pubmed ↗
- 7Newberry EP Hall Z Xie Y Liver-specific deletion of mouse Tm 6sf 2 promotes steatosis, fibrosis, and hepatocellular cancer Hepatology 2021741203121910.1002/hep.3177133638902 PMC 8390580 · doi ↗ · pubmed ↗
- 8Liu YL Reeves HL Burt ADTM 6SF 2 rs 58542926 influences hepatic fibrosis progression in patients with non-alcoholic fatty liver disease Nat Commun 20145430910.1038/ncomms 530924978903 PMC 4279183 · doi ↗ · pubmed ↗
