Reformulation of the protein databank for real-time search of geometrical attributes of protein structures
Musa Azeem, Christopher Lee, Aaron Hein, Christopher Ott, Homayoun Valafar

TL;DR
This paper introduces PDBMine, a platform for analyzing protein structures by examining dihedral angles and sequence patterns in the Protein Data Bank.
Contribution
The novel contribution is PDBMine, a scalable platform for real-time mining of protein structure geometrical attributes and sequence dependencies.
Findings
Longer k-mers show significant deviations from statistical independence, indicating context-dependent amino acid co-occurrence.
Increasing sequence context reduces dihedral angle variability, aligning better with observed backbone geometries.
A clustering method identifies dominant structural motifs from full-sequence dihedral conformations.
Abstract
In this study, we introduce the design and implementation of PDBMine, a large-scale, queryable platform for mining sequence-structure statistics from the Protein Data Bank (PDB). PDBMine enables rapid analysis of local conformational trends across proteins by extracting dihedral angles and sequence patterns at scale. In addition to the design and implementation of PDBMine, we also present results validating its ability to return structurally meaningful information. We first assess the accuracy of its dihedral angle distributions by comparing them to established Ramachandran space and verifying expected behaviors of residues such as glycine and proline. We then use PDBMine to analyze the statistical properties of amino acid subsequences of length k = 1 to 5. Our findings reveal that longer k -mers exhibit significant departures from statistical independence, suggesting…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsProtein Structure and Dynamics · Machine Learning in Bioinformatics · Enzyme Structure and Function
