nach0-pc: Multi-task Language Model with Molecular Point Cloud Encoder
Maksim Kuznetsov, Airat Valiev, Alex Aliper, Daniil Polykovskiy, Elena, Tutubalina, Rim Shayakhmetov, Zulfat Miftahutdinov

TL;DR
nach0-pc is a multi-task language model that incorporates a molecular point cloud encoder to effectively handle 3D spatial structures in molecules, improving drug discovery tasks with reduced training and inference time.
Contribution
The paper introduces nach0-pc, a novel multi-task language model with a molecular point cloud encoder and a new pre-training scheme for spatial molecular data.
Findings
Performance comparable to diffusion models in molecular generation tasks
Capable of multi-task learning and processing 3D molecular data
Reduced training and inference time while maintaining quality
Abstract
Recent advancements have integrated Language Models (LMs) into a drug discovery pipeline. However, existing models mostly work with SMILES and SELFIES chemical string representations, which lack spatial features vital for drug discovery. Additionally, attempts to translate chemical 3D structures into text format encounter issues such as excessive length and insufficient atom connectivity information. To address these issues, we introduce nach0-pc, a model combining domain-specific encoder and textual representation to handle spatial arrangement of atoms effectively. Our approach utilizes a molecular point cloud encoder for concise and order-invariant structure representation. We introduce a novel pre-training scheme for molecular point clouds to distillate the knowledge from spatial molecular structures datasets. After fine-tuning within both single-task and multi-task frameworks,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMachine Learning in Materials Science · Topic Modeling
MethodsDiffusion
