# CysDuF database: annotation and characterization of cysteine residues in domain of unknown function proteins based on cysteine post-translational modifications, their protein microenvironments, biochemical pathways, taxonomy, and diseases

**Authors:** Devarakonda Himaja, Debashree Bandyopadhyay

PMC · DOI: 10.1093/database/baag002 · 2026-01-23

## TL;DR

The CysDuF database characterizes cysteine residues in proteins with unknown functions, focusing on their roles in biochemical pathways and diseases.

## Contribution

The first comprehensive annotation of cysteine post-translational modifications in DUF proteins across multiple pathways and species, including SARS-CoV-2.

## Key findings

- Cysteine residues in DUF proteins are mainly buried and hydrophobic, except in SARS-CoV-2 where they are surface-exposed and hydrophilic.
- Cysteine PTMs were predicted with 79% accuracy using the DeepCys server and validated against experimental data.
- The database includes annotations for seven biochemical pathways and is accessible via DUF, PFAM, or PDB IDs.

## Abstract

Experimental characterization and annotation of amino acids belonging to domains of unknown function (DUF) proteins are expensive and time-consuming, which could be complemented by computational methods. Cysteine, being the second most reactive amino acid at the catalytic sites of enzymes, was selected for functional annotation and characterization on DUF proteins. Earlier, we reported functional annotation of cysteine on DUF proteins belonging to the COX-II family. However, holistic characterization of cysteine functions on DUF proteins was not known, to the best of our knowledge. Here, we annotated and characterized cysteine residues based on post-translational modifications (PTMs), biochemical pathways, diseases, taxonomy, and protein microenvironment. The information on uncharacterized DUF proteins was initially obtained from the literature, and the sequence, structure, pathways, taxonomy, and disease information were retrieved from the SCOPe database using DUF IDs. Protein microenvironments (MENV) around cysteine residues from DUF proteins were computed using protein structures (n = 70 342). The cysteine PTMs were predicted using the in-house cysteine-function prediction server, DeepCys https://deepcys.bits-hyderabad.ac.in). The accuracy of the prediction, validated against known experimental cysteine PTMs (n = 18 626), was 0.79. The information was consolidated in the database (https://cysduf.bits-hyderabad.ac.in/), retrievable in downloadable formats (CSV, JSON, or TXT) using the following inputs, DUF ID, PFAM ID, or PDB ID. For the first time, we annotated cysteine PTMs in DUF proteins belonging to seven different biochemical pathways and various species across the taxonomy, notably for the SARS-CoV-2 virus. The nature of MENV around cysteine from DUF proteins was mainly buried and hydrophobic. However, in the SARS-CoV-2 virus, a significant number of functional cysteine residues were exposed on the surface with hydrophilic microenvironment.

## Full-text entities

- **Genes:** PPP1CA (protein phosphatase 1 catalytic subunit alpha) [NCBI Gene 5499] {aka PP-1A, PP1A, PP1alpha, PPP1A}
- **Diseases:** bacterial infections (MESH:D001424), infection (MESH:D007239), worm (MESH:D017189), inherited diseases (MESH:D030342), urinary tract infections (MESH:D014552), neuronal diseases (MESH:D016472), biliary tract disease (MESH:D001660), parasitic worm (MESH:D010272), septic arthritis (MESH:D001170), sexually transmitted diseases (MESH:D012749), Coronavirus (MESH:D018352), viral diseases (MESH:D014777), protozoan diseases (MESH:D011528), DUF (MESH:D009382), pneumonia (MESH:D011014), lung diseases (MESH:D008171), food-borne illnesses (MESH:D005517), fungal diseases (MESH:D009181)
- **Chemicals:** lipid (MESH:D008055), glutathione (MESH:D005978), fatty acid (MESH:D005227), sulfinic acid (MESH:D013441), Fe (MESH:D007501), S (MESH:D013455), sulfenic acid (MESH:D013434), metal (MESH:D008670), reactive nitrogen species (MESH:D026361), Dipro (-), Cys (MESH:D003545), thiol (MESH:D013438), heavy metal (MESH:D019216), acid (MESH:D000143), ROS (MESH:D017382), disulfide (MESH:D004220), phosphate (MESH:D010710), sulfonic acid (MESH:D013451), carbon (MESH:D002244), thioether (MESH:D013440), Pentose phosphate (MESH:D010428), water (MESH:D014867)
- **Species:** Entamoeba histolytica (species) [taxon 5759], Bacteria Latreille et al. 1825 (Bacteria stick insect, genus) [taxon 629395], Middle East respiratory syndrome-related coronavirus (no rank) [taxon 1335626], Shewanella frigidimarina (species) [taxon 56812], Clostridium botulinum (species) [taxon 1491], Plasmodium falciparum (malaria parasite P. falciparum, species) [taxon 5833], Severe acute respiratory syndrome coronavirus 2 (no rank) [taxon 2697049], Gammacoronavirus (genus) [taxon 694013], Agrobacterium tumefaciens (species) [taxon 358], Fasciola hepatica (liver fluke, species) [taxon 6192], Trichomonas vaginalis (species) [taxon 5722], Homo sapiens (human, species) [taxon 9606], Mycobacterium tuberculosis (species) [taxon 1773]

## Figures

12 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12828279/full.md

---
Source: https://tomesphere.com/paper/PMC12828279