Large-scale protein-protein post-translational modification extraction with distant supervision and confidence calibrated BioBERT
Aparna Elangovan, Yuan Li, Douglas E. V. Pires, Melissa J. Davis and, Karin Verspoor

TL;DR
This paper develops a BioBERT-based ensemble model to extract protein-protein PTMs from vast biomedical literature, achieving high-confidence predictions and highlighting challenges in generalizability and confidence calibration.
Contribution
It introduces a novel ensemble approach with confidence calibration for large-scale PTM extraction from literature, improving precision in protein interaction data mining.
Findings
Retained 19% of predictions with 100% precision.
Extracted 1.6 million PTM-PPI triplets from PubMed.
High confidence predictions achieved 58.8% precision with multiple paper support.
Abstract
Protein-protein interactions (PPIs) are critical to normal cellular function and are related to many disease pathways. However, only 4% of PPIs are annotated with PTMs in biological knowledge databases such as IntAct, mainly performed through manual curation, which is neither time nor cost-effective. We use the IntAct PPI database to create a distant supervised dataset annotated with interacting protein pairs, their corresponding PTM type, and associated abstracts from the PubMed database. We train an ensemble of BioBERT models - dubbed PPI-BioBERT-x10 to improve confidence calibration. We extend the use of ensemble average confidence approach with confidence variation to counteract the effects of class imbalance to extract high confidence predictions. The PPI-BioBERT-x10 model evaluated on the test set resulted in a modest F1-micro 41.3 (P =5 8.1, R = 32.1). However, by combining high…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBiomedical Text Mining and Ontologies · Bioinformatics and Genomic Networks · Microbial Metabolism and Applications
