Benchmarking Simulacra AI's Quantum Accurate Synthetic Data Generation for Chemical Sciences
Fabio Falcioni, Elena Orlova, Timothy Heightman, Philip Mantrov, Aleksei Ustimenko

TL;DR
This paper benchmarks Simulacra's quantum accurate synthetic data generation pipeline against a Microsoft pipeline, demonstrating significant cost reductions and maintained accuracy for large-scale chemical datasets, enabling faster AI-driven discovery.
Contribution
It introduces a novel sampling scheme, RELAX, and demonstrates that Simulacra's LWM pipeline reduces data costs while maintaining accuracy compared to traditional methods.
Findings
Data generation costs reduced by 15-50x
Maintains parity in energy accuracy
Achieves 2-3x efficiency over CCSD methods
Abstract
In this work, we benchmark \simulacra's synthetic data generation pipeline against a state-of-the-art Microsoft pipeline on a dataset of small to large systems. By analyzing the energy quality, autocorrelation times, and effective sample size, our findings show that Simulacra's Large Wavefunction Models (LWM) pipeline, paired with state-of-the-art Variational Monte Carlo (VMC) sampling algorithms, reduces data generation costs by 15-50x, while maintaining parity in energy accuracy, and 2-3x compared to traditional CCSD methods on the scale of amino acids. This enables the creation of affordable, large-scale \textit{ab-initio} datasets, accelerating AI-driven optimization and discovery in the pharmaceutical industry and beyond. Our improvements are based on a novel and proprietary sampling scheme called Replica Exchange with Langevin Adaptive eXploration (RELAX).
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Scientific Computing and Data Management · Protein Structure and Dynamics
