Driving Accurate Allergen Prediction with Protein Language Models and Generalization-Focused Evaluation

Brian Shing-Hei Wong; Joshua Mincheol Kim; Sin-Hang Fung; Qing Xiong; Kelvin Fu-Kiu Ao; Junkang Wei; Ran Wang; Dan Michelle Wang; Jingying Zhou; Bo Feng; Alfred Sze-Lok Cheng; Kevin Y. Yip; Stephen Kwok-Wing Tsui; Qin Cao

arXiv:2508.10541·cs.LG·August 18, 2025

Driving Accurate Allergen Prediction with Protein Language Models and Generalization-Focused Evaluation

Brian Shing-Hei Wong, Joshua Mincheol Kim, Sin-Hang Fung, Qing Xiong, Kelvin Fu-Kiu Ao, Junkang Wei, Ran Wang, Dan Michelle Wang, Jingying Zhou, Bo Feng, Alfred Sze-Lok Cheng, Kevin Y. Yip, Stephen Kwok-Wing Tsui, Qin Cao

PDF

TL;DR

This paper introduces Applm, a new allergen prediction framework using large protein language models, which outperforms existing methods in challenging real-world scenarios and emphasizes generalization and mutation analysis.

Contribution

The paper presents Applm, a novel allergen prediction method leveraging a 100-billion parameter protein language model, with a focus on challenging real-world tasks and generalization.

Findings

01

Applm outperforms seven state-of-the-art methods across diverse tasks.

02

xTrimoPGLM captures general protein features crucial for allergen prediction.

03

The framework effectively identifies novel allergens and assesses mutation impacts.

Abstract

Allergens, typically proteins capable of triggering adverse immune responses, represent a significant public health challenge. To accurately identify allergen proteins, we introduce Applm (Allergen Prediction with Protein Language Models), a computational framework that leverages the 100-billion parameter xTrimoPGLM protein language model. We show that Applm consistently outperforms seven state-of-the-art methods in a diverse set of tasks that closely resemble difficult real-world scenarios. These include identifying novel allergens that lack similar examples in the training set, differentiating between allergens and non-allergens among homologs with high sequence similarity, and assessing functional consequences of mutations that create few changes to the protein sequences. Our analysis confirms that xTrimoPGLM, originally trained on one trillion tokens to capture general protein…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.