GenDiS3 database: census on the prevalence of protein domain superfamilies of known structure in the entire sequence database
Sarthak Joshi, Shailendu Mohapatra, Dhwani Kumar, Adwait Joshi, Meenakshi Iyer, Ramanathan Sowdhamini

TL;DR
GenDiS3 is a database that links protein sequences to known structures, helping researchers understand evolutionary and functional diversity of proteins.
Contribution
GenDiS3 introduces improved bioinformatics tools for accurate domain identification and provides validated homologs for 2060 superfamilies.
Findings
Over 151 million sequence homologs were identified and 116 million validated as true positives using DELTA-BLAST and HMMSCAN.
Case studies on glycolysis enzymes and the LOG gene reveal evolutionary and functional variations through domain architecture analysis.
Abstract
Despite the vast amount of sequence data available, a significant disparity exists between the number of protein sequences identified and the relatively few structures that have been resolved. This disparity highlights the challenge in structural biology to bridge the gap between sequence information and 3D structural data, and the necessity for robust databases capable of linking distant homologs to known structures. Studies have indicated that there are a limited number of structural folds, despite the vast diversity of proteins. Hence, computational tools can enhance our ability to classify protein sequences, much before their structures are determined or their functions are characterized, thereby bridging the gap between sequence and structural data. GenDiS (Genomic Distribution of Superfamilies) is a repository with information on the genomic distribution of protein domain…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Bioinformatics · RNA and protein synthesis mechanisms · Protein Structure and Dynamics
