Mitigating the Antigenic Data Bottleneck: Semi-supervised Learning with Protein Language Models for Influenza A Surveillance
Yanhua Xu

TL;DR
This study demonstrates that combining Protein Language Models with Semi-Supervised Learning significantly improves influenza antigenicity prediction accuracy when labeled data are scarce, facilitating better surveillance and vaccine development.
Contribution
It introduces a novel approach integrating PLMs with SSL to enhance influenza antigenicity prediction under limited labeled data conditions.
Findings
SSL improves performance in low-label regimes
ESM-2 embeddings are highly robust for antigenicity prediction
SSL mitigates performance decline in hypervariable H3N2 subtype
Abstract
Influenza A viruses (IAVs) evolve antigenically at a pace that requires frequent vaccine updates, yet the haemagglutination inhibition (HI) assays used to quantify antigenicity are labor-intensive and unscalable. As a result, genomic data vastly outpace available phenotypic labels, limiting the effectiveness of traditional supervised models. We hypothesize that combining pre-trained Protein Language Models (PLMs) with Semi-Supervised Learning (SSL) can retain high predictive accuracy even when labeled data are scarce. We evaluated two SSL strategies, Self-training and Label Spreading, against fully supervised baselines using four PLM-derived embeddings (ESM-2, ProtVec, ProtT5, ProtBert) applied to haemagglutinin (HA) sequences. A nested cross-validation framework simulated low-label regimes (25%, 50%, 75%, and 100% label availability) across four IAV subtypes (H1N1, H3N2, H5N1, H9N2).…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInfluenza Virus Research Studies · vaccines and immunoinformatics approaches · Machine Learning in Bioinformatics
