Thinking like a CHEMIST: Combined Heterogeneous Embedding Model Integrating Structure and Tokens

Nikolai Rekut; Alexey Orlov; Klea Ziu; Elizaveta Starykh; Martin Takac; Aleksandr Beznosikov

arXiv:2502.17986·cs.LG·May 27, 2025

Thinking like a CHEMIST: Combined Heterogeneous Embedding Model Integrating Structure and Tokens

Nikolai Rekut, Alexey Orlov, Klea Ziu, Elizaveta Starykh, Martin Takac, Aleksandr Beznosikov

PDF

Open Access 4 Reviews

TL;DR

This paper introduces a novel molecular representation combining substructure descriptors with language and graph models, improving performance in chemical prediction tasks.

Contribution

It proposes a combined heterogeneous embedding model that integrates detailed substructure descriptors with language and graph-based models for chemistry applications.

Findings

01

Improved QSAR prediction accuracy.

02

Enhanced molecular representation capturing chemical details.

03

Effective integration of substructure descriptors with models.

Abstract

Representing molecular structures effectively in chemistry remains a challenging task. Language models and graph-based models are extensively utilized within this domain, consistently achieving state-of-the-art results across an array of tasks. However, the prevailing practice of representing chemical compounds in the SMILES format - used by most data sets and many language models - presents notable limitations as a training data format. In this study, we present a novel approach that decomposes molecules into substructures and computes descriptor-based representations for these fragments, providing more detailed and chemically relevant input for model training. We use this substructure and descriptor data as input for language model and also propose a bimodal architecture that integrates this language model with graph-based models. As LM we use RoBERTa, Graph Isomorphism Networks…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 4

Strengths

The paper presents a clear methodological pipeline that bridges chemical descriptors with transformer-based and graph-based architectures, emphasizing interpretability and domain knowledge. The BRICS-based decomposition is a reasonable choice for fragment-level modeling, offering a structured alternative to SMILES-based tokenization. The contrastive learning framework for aligning substructure- and graph-level embeddings is technically well-motivated. Performance improvements on several benchmar

Weaknesses

Despite its clarity, the paper lacks true novelty and broader experimental support. The approach essentially reuses established components (RoBERTa, GCN/GIN/Graphormer, BRICS fragmentation, and descriptor-based features) and combines them without demonstrating a clear new principle or theoretical insight. The claim of “thinking like a chemist” remains largely rhetorical—there is no evidence that the model captures reasoning-like processes, causal relations, or interpretable chemistry. Moreover,

Reviewer 02Rating 4Confidence 4

Strengths

The method demonstrates strong performance on downstream molecular property prediction tasks.

Weaknesses

Using “Thinking Like a Chemist” in your title naturally sets the expectation that the model can reason about chemistry, make decisions, or mimic a chemist’s problem-solving process (e.g., predicting reaction outcomes, designing new molecules intelligently, or explaining chemical phenomena). Your paper is about learning molecular representations --> the title is misleading, because representation learning alone does not involve reasoning The approach of fragmenting molecules using BRICS and repre

Reviewer 03Rating 4Confidence 3

Strengths

1. The idea of using BRICS fragmentation to create a "chemical vocabulary" and then describing each "word" (substructure) with a rich set of descriptors is creative and well-motivated. The argument for aligning the model's "thinking" with a chemist's fragment-based reasoning is compelling. 2.The paper is generally well-written and clear. The figures effectively illustrate the overall architecture and key processes like tokenization and graph augmentation. The methodology is explained in a logic

Weaknesses

1. A significant weakness is the lack of discussion and comparison with other recent multi-modal molecular models. The related work section and experiments focus on unimodal (SMILES-based LMs or GNNs) and simpler bimodal (SMILES+Graph) models. However, several advanced multi-modal frameworks have been proposed that also aim to fuse different molecular perspectives. Notably: a. MoleculeSTM (Liu et al., Nature Machine Intelligence 2023) is a multi-modal model that aligns molecular structures with

Reviewer 04Rating 2Confidence 4

Strengths

This paper is well written and easy to understand.

Weaknesses

1. Insufficient Discussion of Related Work: The paper does not adequately situate itself within the growing field of fragment- or motif-based molecular pre-training. There is no discussion or comparison with models like FineMolTex [1], MoleculeSTM [2], and MolCA [3], which have a very similar bimodal design. This omission weakens the claim of novelty. 2. Narrow Scope of Evaluation: The experimental validation is limited to standard property prediction tasks (QSAR). For a bimodal mode

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenetics, Bioinformatics, and Biomedical Research

MethodsAttention Is All You Need · Adam · Softmax · Dropout · Weight Decay · Dense Connections · Attention Dropout · Linear Layer · Layer Normalization · Residual Connection