# GenDiS3 database: census on the prevalence of protein domain superfamilies of known structure in the entire sequence database

**Authors:** Sarthak Joshi, Shailendu Mohapatra, Dhwani Kumar, Adwait Joshi, Meenakshi Iyer, Ramanathan Sowdhamini

PMC · DOI: 10.1093/database/baaf035 · 2025-05-09

## TL;DR

GenDiS3 is a database that links protein sequences to known structures, helping researchers understand evolutionary and functional diversity of proteins.

## Contribution

GenDiS3 introduces improved bioinformatics tools for accurate domain identification and provides validated homologs for 2060 superfamilies.

## Key findings

- Over 151 million sequence homologs were identified and 116 million validated as true positives using DELTA-BLAST and HMMSCAN.
- Case studies on glycolysis enzymes and the LOG gene reveal evolutionary and functional variations through domain architecture analysis.

## Abstract

Despite the vast amount of sequence data available, a significant disparity exists between the number of protein sequences identified and the relatively few structures that have been resolved. This disparity highlights the challenge in structural biology to bridge the gap between sequence information and 3D structural data, and the necessity for robust databases capable of linking distant homologs to known structures. Studies have indicated that there are a limited number of structural folds, despite the vast diversity of proteins. Hence, computational tools can enhance our ability to classify protein sequences, much before their structures are determined or their functions are characterized, thereby bridging the gap between sequence and structural data. GenDiS (Genomic Distribution of Superfamilies) is a repository with information on the genomic distribution of protein domain superfamilies, involving a one-time computational exercise to search for trusted homologs of protein domains of known structures against the vast sequence database. We have updated this database employing advanced bioinformatics tools, including DELTA-BLAST (domain enhanced lookup time accelerated BLAST) for initial detection of hits and HMMSCAN for validation, significantly improving the accuracy of domain identification. Using these tools, over 151 million sequence homologs for 2060 superfamilies [SCOPe (Structural Classification of Proteins extended)] were identified and 116 million out of them were validated as true positives. Through a case study on glycolysis-related enzymes, variations in domain architectures of these enzymes are explored, revealing evolutionary changes and functional diversity among these essential proteins. We present another case, LOG gene, where one can tune in and find significant mutations across the evolutionary lineage. The GenDiS database, GenDiS3, and the associated tools made available at https://caps.ncbs.res.in/gendis3/ offer a powerful resource for researchers in functional annotation and evolutionary studies.

Database URL: https://caps.ncbs.res.in/gendis3/

## Linked entities

- **Genes:** LOC4324445 (cytokinin riboside 5'-monophosphate phosphoribohydrolase LOG-like) [NCBI Gene 4324445]

## Full-text entities

- **Mutations:** DELTA

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12063530/full.md

---
Source: https://tomesphere.com/paper/PMC12063530