AntigenLM: Structure-Aware DNA Language Modeling for Influenza

Yue Pei; Xuebin Chi; Yu Kang

arXiv:2602.09067·q-bio.GN·February 12, 2026

AntigenLM: Structure-Aware DNA Language Modeling for Influenza

Yue Pei, Xuebin Chi, Yu Kang

PDF

Open Access 3 Reviews

TL;DR

AntigenLM is a structure-aware DNA language model pretrained on influenza genomes that effectively predicts antigenic evolution and classifies subtypes, outperforming traditional models by leveraging functional-unit integrity.

Contribution

This work introduces AntigenLM, a novel structure-aware DNA language model that captures evolutionary constraints and improves antigenic variant prediction and classification.

Findings

01

AntigenLM accurately forecasts future influenza antigenic variants.

02

Disrupting genomic structure reduces model performance significantly.

03

AntigenLM outperforms phylogenetic and evolution-based models.

Abstract

Language models have advanced sequence analysis, yet DNA foundation models often lag behind task-specific methods for unclear reasons. We present AntigenLM, a generative DNA language model pretrained on influenza genomes with intact, aligned functional units. This structure-aware pretraining enables AntigenLM to capture evolutionary constraints and generalize across tasks. Fine-tuned on time-series hemagglutinin (HA) and neuraminidase (NA) sequences, AntigenLM accurately forecasts future antigenic variants across regions and subtypes, including those unseen during training, outperforming phylogenetic and evolution-based models. It also achieves near-perfect subtype classification. Ablation studies show that disrupting genomic structure through fragmentation or shuffling severely degrades performance, revealing the importance of preserving functional-unit integrity in DNA language…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

1. It is a novel idea to use a DNA language model for predicting viral evolution, which is an important problem for public health. 2. There are very few DNA language models for eukaryotic viruses, and this work has filled this gap. 3. The training data is carefully curated and documented in detail. It could be a useful resource to the community. 4. The evaluation tasks are well-designed, and data splitting was done thoughtfully.

Weaknesses

1. I am not sure if the authors covered all the appropriate baselines. I am not super familiar with the field, but there seem to be many works on using protein language models to predict viral evolution. For example, https://www.science.org/doi/10.1126/science.abd7331, https://www.nature.com/articles/s41392-024-02066-x, https://www.biorxiv.org/content/10.1101/2025.08.04.668423v1. Although this model is trained at the DNA level, the evaluations seem mostly at the amino acid level. The authors sho

Reviewer 02Rating 2Confidence 4

Strengths

The idea of preserving full functional units during pretraining feels biologically meaningful and easy to justify. It makes sense that genomes should be learned as whole systems, not chopped-up pieces. The model is compact and computationally reasonable, which makes the method more accessible to other labs. The overall contribution to treating DNA as structured language is simple but potentially important for future genomic foundation models.

Weaknesses

1. The paper claims a big jump over existing models, but I’m not fully convinced it’s a fair comparison. The baselines feel a bit old — why not include newer genomic LMs like HyenaDNA or TITAN? That would make the improvement more believable. 2. Figure 1 is visually appealing and central to understanding the paper, but it could better communicate data imbalance, temporal coverage, and architectural specifics. A supplementary figure showing training sample counts per region/year and a more expli

Reviewer 03Rating 6Confidence 3

Strengths

- The paper presented a sound way of collecting and preprocessing the influenza genomes, taking care of gene segments alignment and functional unit preservation, which led to an efficient autoregressive model pretraining. Additionally, it has been shown that overlooking alignment and functional unit preservation can lead to performance degradation. - The proposed model improved influenza forecasting over evolutionary models and tree-based predictors. The model achieved lower genetic mismatches b

Weaknesses

- The presentation of the paper requires improvement. The overall writing style and, in particular, the clarity of explanations are of low quality. While the pretraining procedure of the model is well described, the fine-tuning section is poorly explained. The fine-tuning process for each specific task should be described in greater detail. I recommend illustrating the fine-tuning procedures with clear indications of the prompts and the regions being forecasted. The current Figure 1B does not co

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

Topicsvaccines and immunoinformatics approaches · Influenza Virus Research Studies · Machine Learning in Bioinformatics