Data Archive Challenge: Transitioning Users to New IDs and Data File Format at the Protein Data Bank
Brian P Hudson, Zukang Feng, Irina Persikova, Yuhe Liang, Ezra Peisach, Jasmine Y Young, Stephen K Burley

TL;DR
The Protein Data Bank is transitioning to new extended IDs and a new data format to handle growing data needs and ensure long-term usability.
Contribution
The paper introduces a five-year transition plan for adopting extended PDB IDs and exclusive use of the PDBx/mmCIF format.
Findings
Extended PDB IDs (e.g., pdb_10021abc) will be introduced to replace four-character IDs by 2028.
New entries will be available only in the PDBx/mmCIF format, which supports complex and large biostructures.
A transition plan includes communication strategies, training resources, and a beta archive for user testing.
Abstract
The Protein Data Bank (PDB), established in 1971 as the first open-access digital resource in biology, has grown from seven X-ray crystallographic structures to over 233,000 3D structures of biological macromolecules supporting research and education across various scientific fields, including fundamental biology, biomedicine, biotechnology, and energy sciences. The Worldwide Protein Data Bank (wwPDB, wwpdb.org) partnership includes five Full Members and one Associate Member, all committed to making structural biology data Findable, Accessible, Interoperable, and Reusable (FAIR). Due to the rapid increase in structure depositions, the wwPDB anticipates exhausting the current four-character PDB IDs by 2028. To prepare for this milestone, the PDB is transitioning to extended IDs, formatted as "pdb_" followed by eight alphanumeric characters (e.g., pdb_10021abc). Once four-character PDB…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsResearch Data Management Practices · Scientific Computing and Data Management · Genetics, Bioinformatics, and Biomedical Research
