AntigenLM: Structure-Aware DNA Language Modeling for Influenza
Yue Pei, Xuebin Chi, Yu Kang

TL;DR
AntigenLM is a structure-aware DNA language model pretrained on influenza genomes that effectively predicts antigenic evolution and classifies subtypes, outperforming traditional models by leveraging functional-unit integrity.
Contribution
This work introduces AntigenLM, a novel structure-aware DNA language model that captures evolutionary constraints and improves antigenic variant prediction and classification.
Findings
AntigenLM accurately forecasts future influenza antigenic variants.
Disrupting genomic structure reduces model performance significantly.
AntigenLM outperforms phylogenetic and evolution-based models.
Abstract
Language models have advanced sequence analysis, yet DNA foundation models often lag behind task-specific methods for unclear reasons. We present AntigenLM, a generative DNA language model pretrained on influenza genomes with intact, aligned functional units. This structure-aware pretraining enables AntigenLM to capture evolutionary constraints and generalize across tasks. Fine-tuned on time-series hemagglutinin (HA) and neuraminidase (NA) sequences, AntigenLM accurately forecasts future antigenic variants across regions and subtypes, including those unseen during training, outperforming phylogenetic and evolution-based models. It also achieves near-perfect subtype classification. Ablation studies show that disrupting genomic structure through fragmentation or shuffling severely degrades performance, revealing the importance of preserving functional-unit integrity in DNA language…
Peer Reviews
Decision·ICLR 2026 Poster
1. It is a novel idea to use a DNA language model for predicting viral evolution, which is an important problem for public health. 2. There are very few DNA language models for eukaryotic viruses, and this work has filled this gap. 3. The training data is carefully curated and documented in detail. It could be a useful resource to the community. 4. The evaluation tasks are well-designed, and data splitting was done thoughtfully.
1. I am not sure if the authors covered all the appropriate baselines. I am not super familiar with the field, but there seem to be many works on using protein language models to predict viral evolution. For example, https://www.science.org/doi/10.1126/science.abd7331, https://www.nature.com/articles/s41392-024-02066-x, https://www.biorxiv.org/content/10.1101/2025.08.04.668423v1. Although this model is trained at the DNA level, the evaluations seem mostly at the amino acid level. The authors sho
The idea of preserving full functional units during pretraining feels biologically meaningful and easy to justify. It makes sense that genomes should be learned as whole systems, not chopped-up pieces. The model is compact and computationally reasonable, which makes the method more accessible to other labs. The overall contribution to treating DNA as structured language is simple but potentially important for future genomic foundation models.
1. The paper claims a big jump over existing models, but I’m not fully convinced it’s a fair comparison. The baselines feel a bit old — why not include newer genomic LMs like HyenaDNA or TITAN? That would make the improvement more believable. 2. Figure 1 is visually appealing and central to understanding the paper, but it could better communicate data imbalance, temporal coverage, and architectural specifics. A supplementary figure showing training sample counts per region/year and a more expli
- The paper presented a sound way of collecting and preprocessing the influenza genomes, taking care of gene segments alignment and functional unit preservation, which led to an efficient autoregressive model pretraining. Additionally, it has been shown that overlooking alignment and functional unit preservation can lead to performance degradation. - The proposed model improved influenza forecasting over evolutionary models and tree-based predictors. The model achieved lower genetic mismatches b
- The presentation of the paper requires improvement. The overall writing style and, in particular, the clarity of explanations are of low quality. While the pretraining procedure of the model is well described, the fine-tuning section is poorly explained. The fine-tuning process for each specific task should be described in greater detail. I recommend illustrating the fine-tuning procedures with clear indications of the prompts and the regions being forecasted. The current Figure 1B does not co
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topicsvaccines and immunoinformatics approaches · Influenza Virus Research Studies · Machine Learning in Bioinformatics
