Accelerating Large-Scale Cheminformatics Using a Byte-Offset Indexing Architecture for Terabyte-Scale Data Integration
Malikussaid, Septian Caesar Floresko, Sutiyo

TL;DR
This paper demonstrates a byte-offset indexing approach that significantly accelerates large-scale chemical database integration, enabling efficient, validated dataset creation for cheminformatics and machine learning.
Contribution
It introduces a scalable byte-offset indexing architecture that overcomes brute-force limitations, ensuring data integrity and efficiency at hundreds of millions of entries.
Findings
Achieved 740-fold performance improvement over brute-force methods
Validated 176 million chemical entries with collision-free identifiers
Provided benchmarks and trade-offs for large-scale data integration
Abstract
The integration of large-scale chemical databases represents a critical bottleneck in modern cheminformatics research, particularly for machine learning applications requiring high-quality, multi-source validated datasets. This paper presents a case study of integrating three major public chemical repositories: PubChem (176 million compounds), ChEMBL, and eMolecules, to construct a curated dataset for molecular property prediction. We investigate whether byte-offset indexing can practically overcome brute-force scalability limits while preserving data integrity at hundred-million scale. Our results document the progression from an intractable brute-force search algorithm with projected 100-day runtime to a byte-offset indexing architecture achieving 3.2-hour completion - a 740-fold performance improvement through algorithmic complexity reduction from to .…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational Drug Discovery Methods · Scientific Computing and Data Management · Machine Learning in Materials Science
