Curriculum Learning for Biological Sequence Prediction: The Case of De Novo Peptide Sequencing
Xiang Zhang, Jiaqi Wei, Zijie Qiu, Sheng Xu, Nanqing Dong, Zhiqiang Gao, Siqi Sun

TL;DR
This paper introduces a curriculum learning strategy for non-autoregressive peptide sequencing models, significantly reducing training failures and improving accuracy across multiple species by progressively teaching the model from simple to complex sequences.
Contribution
The paper proposes a novel curriculum learning approach for NAT-based peptide sequencing, incorporating a self-refining inference module, which enhances training stability and prediction accuracy.
Findings
Reduces NAT training failures by over 90%.
Outperforms previous methods on nine benchmark species.
Improves sequence accuracy through iterative inference.
Abstract
Peptide sequencing-the process of identifying amino acid sequences from mass spectrometry data-is a fundamental task in proteomics. Non-Autoregressive Transformers (NATs) have proven highly effective for this task, outperforming traditional methods. Unlike autoregressive models, which generate tokens sequentially, NATs predict all positions simultaneously, leveraging bidirectional context through unmasked self-attention. However, existing NAT approaches often rely on Connectionist Temporal Classification (CTC) loss, which presents significant optimization challenges due to CTC's complexity and increases the risk of training failures. To address these issues, we propose an improved non-autoregressive peptide sequencing model that incorporates a structured protein sequence curriculum learning strategy. This approach adjusts protein's learning difficulty based on the model's estimated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenetics, Bioinformatics, and Biomedical Research · Genomics and Phylogenetic Studies
