# Smart distributed data factory volunteer computing platform for active learning-driven molecular data acquisition

**Authors:** Tsolak Ghukasyan, Vahagn Altunyan, Aram Bughdaryan, Tigran Aghajanyan, Khachik Smbatyan, Garegin A. Papoian, Garik Petrosyan

PMC · DOI: 10.1038/s41598-025-90981-6 · Scientific Reports · 2025-02-28

## TL;DR

The SDDF platform uses global volunteer computing and machine learning to efficiently generate molecular data for drug discovery.

## Contribution

A novel volunteer computing platform for DFT calculations and an active learning framework for molecular conformation datasets.

## Key findings

- SDDF generates a large public dataset of ENAMINE molecules with calculated energies.
- The platform reduces the need for extensive quantum chemistry calculations through active learning.
- The dataset supports training and benchmarking energy prediction models.

## Abstract

This paper presents the smart distributed data factory (SDDF), an AI-driven distributed computing platform designed to address challenges in drug discovery by creating comprehensive datasets of molecular conformations and their properties. SDDF uses volunteer computing, leveraging the processing power of personal computers worldwide to accelerate quantum chemistry (DFT) calculations. To tackle the vast chemical space and limited high-quality data, SDDF employs an ensemble of machine learning (ML) models to predict molecular properties and selectively choose the most challenging data points for further DFT calculations. The platform also generates new molecular conformations using molecular dynamics with the forces derived from these models. SDDF makes several contributions: the volunteer computing platform for DFT calculations; an active learning framework for constructing a dataset of molecular conformations; a large public dataset of diverse ENAMINE molecules with calculated energies; an ensemble of ML models for accurate energy prediction. The energy dataset was generated to validate the SDDF approach of reducing the need for extensive calculations. With its strict scaffold split, the dataset can be used for training and benchmarking energy models. By combining active learning, distributed computing, and quantum chemistry, SDDF offers a scalable, cost-effective solution for developing accurate molecular models and ultimately accelerating drug discovery.

The online version contains supplementary material available at 10.1038/s41598-025-90981-6.

## Full-text entities

- **Diseases:** SDDF (MESH:D020243)
- **Chemicals:** Bromine (MESH:D001966), ENAMINE (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC11868574/full.md

## Figures

6 figures with captions in the complete paper: https://tomesphere.com/paper/PMC11868574/full.md

## References

28 references — full list in the complete paper: https://tomesphere.com/paper/PMC11868574/full.md

---
Source: https://tomesphere.com/paper/PMC11868574