# DFL-MHC: MHC identification model based on dual-stage training and multi-view feature fusion

**Authors:** Yanjuan Li, Yiben Lin, Dong Chen

PMC · DOI: 10.3389/fgene.2026.1774569 · Frontiers in Genetics · 2026-01-21

## TL;DR

DFL-MHC is a new model for accurately identifying MHC proteins using advanced training and feature fusion techniques.

## Contribution

DFL-MHC introduces a dual-stage training and multi-view feature fusion framework for improved MHC identification.

## Key findings

- DFL-MHC outperforms existing methods in MHC identification accuracy.
- The model captures complementary information across sequence lengths and different PLMs.
- A BiLSTM with attention mechanism effectively models deep semantic dependencies.

## Abstract

The major histocompatibility complex (MHC) is the central genetic basis of adaptive immune responses, it plays a crucial role in antigen presentation, immune surveillance, and susceptibility to various diseases. Therefore, accurate MHC identification is essential for both immunological research and clinical applications. Most existing methods still depend on manually engineered features or a single protein language model (PLM for short), these methods cannot perfectly capture complementary information across sequence lengths or across different PLMs. Furthermore, most existing methods often adopt conventional machine learning algorithms or simple multilayer perceptron (MLP) classifiers to construct identification model, they have no ability to model deep semantic dependencies within sequences. To overcome these limitations, we introduce an MHC identification model based on dual-stage training and multi-view feature fusion, termed DFL-MHC, a novel framework that unifies multi-sequence and multi-model views within a dual-stage training strategy. In the feature extraction stage, we design a cross-sequence and cross-model multi-view scheme. In this scheme, a protein sequence is truncated into two different residue sequences with a length of 1,022, two PLMs are respectively employed to extract features from the two different residue sequences, these extracted features are fused to represent the protein sequence. The dimensionality reduction algorithm is applied to the fused features and obtain the optimal feature subset. The optimal feature subset can fully capture complementary information across sequence lengths and across different PLMs. In the feature modeling stage, we construct a bi-directional long short-term memory (BiLSTM) network incorporating an attention mechanism to capture long-range dependencies and deep semantic dependencies within sequences. On the MHC identification task, DFL-MHC achieves better performance than the existing methods. It is demonstrated that the effectiveness of leveraging both multi-view feature fusion and dual-stage training to achieve accurate and reliable MHC identification.

## Linked entities

- **Proteins:** HLA-C (major histocompatibility complex, class I, C)

## Full-text entities

- **Genes:** HLA-C (major histocompatibility complex, class I, C) [NCBI Gene 3107] {aka D6S204, HLA-JY3, HLAC, HLC-C, MHC, PSORS1}

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12868131/full.md

## Figures

9 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12868131/full.md

## References

60 references — full list in the complete paper: https://tomesphere.com/paper/PMC12868131/full.md

---
Source: https://tomesphere.com/paper/PMC12868131