ZSMILES: an approach for efficient SMILES storage for random access in Virtual Screening
Gianmarco Accordi, Davide Gadioli, Giorgio Seguini, Andrea R. Beccari,, and Gianluca Palermo

TL;DR
This paper introduces ZSMILES, a dictionary-based compression method for SMILES datasets in virtual screening, achieving better compression and faster access, thus reducing storage needs in high-throughput drug discovery applications.
Contribution
The paper presents a novel domain knowledge-based compression approach for SMILES data, enabling efficient storage and random access in large virtual screening datasets.
Findings
ZSMILES compresses datasets by up to 1.13 times more than existing methods.
CUDA implementation achieves a 7x speedup in data processing.
Reduces cold storage footprint in HPC systems for virtual screening.
Abstract
Virtual screening is a technique used in drug discovery to select the most promising molecules to test in a lab. To perform virtual screening, we need a large set of molecules as input, and storing these molecules can become an issue. In fact, extreme-scale high-throughput virtual screening applications require a big dataset of input molecules and produce an even bigger dataset as output. These molecules' databases occupy tens of TB of storage space, and domain experts frequently sample a small portion of this data. In this context, SMILES is a popular data format for storing large sets of molecules since it requires significantly less space to represent molecules than other formats (e.g., MOL2, SDF). This paper proposes an efficient dictionary-based approach to compress SMILES-based datasets. This approach takes advantage of domain knowledge to provide a readable output with separable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGene expression and cancer classification · Radiomics and Machine Learning in Medical Imaging
