mdCATH: A Large-Scale MD Dataset for Data-Driven Computational Biophysics
Antonio Mirarchi, Toni Giorgino, Gianni De Fabritiis

TL;DR
mdCATH is a comprehensive large-scale dataset of protein dynamics generated through extensive molecular dynamics simulations, providing valuable data for understanding protein behavior, folding, and interactions.
Contribution
This work introduces mdCATH, the first large-scale, detailed MD dataset of 5,398 protein domains with multi-temperature simulations, filling a critical gap in protein dynamics data.
Findings
Dataset includes over 62 ms of simulation data.
Demonstrated potential for analyzing protein unfolding thermodynamics.
Showcased applications through four case studies.
Abstract
Recent advancements in protein structure determination are revolutionizing our understanding of proteins. Still, a significant gap remains in the availability of comprehensive datasets that focus on the dynamics of proteins, which are crucial for understanding protein function, folding, and interactions. To address this critical gap, we introduce mdCATH, a dataset generated through an extensive set of all-atom molecular dynamics simulations of a diverse and representative collection of protein domains. This dataset comprises all-atom systems for 5,398 domains, modeled with a state-of-the-art classical force field, and simulated in five replicates each at five temperatures from 320 K to 450 K. The mdCATH dataset records coordinates and forces every 1 ns, for over 62 ms of accumulated simulation time, effectively capturing the dynamics of the various classes of domains and providing a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSparse Evolutionary Training · Focus
