CATHe2: Enhanced CATH superfamily detection using ProstT5 and structural alphabets
Orfeú Mouret, Jad Abbass

TL;DR
CATHe2 is an improved automated system for classifying protein domain superfamilies using updated language models and structural data.
Contribution
CATHe2 introduces ProstT5 and structural alphabet embeddings to enhance CATH superfamily classification accuracy and F1 score.
Findings
CATHe2 achieves 92.2% accuracy and 82.3% F1 score, a significant improvement over CATHe.
Using ProstT5 and 3Di embeddings boosts performance by 9.9% in F1 score and 6.6% in accuracy.
A simplified version using only AA sequences still improves F1 score by 6.7% and accuracy by 6.6%.
Abstract
The CATH database is a free publicly available online resource that provides annotations about the evolutionary and structural relationships of protein domains. Due to the flux of protein structures coming mainly from the recent breakthrough of AlphaFold and therefore the non-feasibility of manual intervention, the CATH team recently developed an automatic CATH superfamily (SF) classifier called CATHe, which uses a feed-forward neural network (FNN) classifier with protein Language Model (pLM) embeddings as input. Using the same dataset of remote homologues (with a 20% sequence identity threshold), this paper presents CATHe2, which improves on CATHe by switching the old pLM ProtT5 for one of the most recent versions called ProstT5, and by incorporating domain 3D information into the classifier through Structural Alphabet representation, specifically, 3Di sequence embeddings. Finally,…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Rare Diseases · Machine Learning in Bioinformatics · Protein Structure and Dynamics
