Data-Efficient Molecular Generation with Hierarchical Textual Inversion

Seojin Kim; Jaehyun Nam; Sihyun Yu; Younghoon Shin; Jinwoo Shin

arXiv:2405.02845·cs.LG·July 17, 2024

Data-Efficient Molecular Generation with Hierarchical Textual Inversion

Seojin Kim, Jaehyun Nam, Sihyun Yu, Younghoon Shin, Jinwoo Shin

PDF

Open Access 1 Repo 3 Reviews

TL;DR

HI-Mol introduces a hierarchical textual inversion approach for data-efficient molecular generation, leveraging multi-level embeddings to improve low-data molecule modeling and outperform prior methods significantly.

Contribution

The paper proposes a novel hierarchical textual inversion technique for molecular generation that effectively captures multi-level features, enabling high-quality molecule synthesis with limited data.

Findings

01

HI-Mol outperforms previous methods with 50x less data on QM9.

02

Multi-level embeddings improve low-shot molecule distribution learning.

03

Generated molecules are effective for low-shot property prediction.

Abstract

Developing an effective molecular generation framework even with a limited number of molecules is often important for its practical deployment, e.g., drug discovery, since acquiring task-related molecular data requires expensive and time-consuming experimental costs. To tackle this issue, we introduce Hierarchical textual Inversion for Molecular generation (HI-Mol), a novel data-efficient molecular generation method. HI-Mol is inspired by the importance of hierarchical information, e.g., both coarse- and fine-grained features, in understanding the molecule distribution. We propose to use multi-level embeddings to reflect such hierarchical features based on the adoption of the recent textual inversion technique in the visual domain, which achieves data-efficient image generation. Compared to the conventional textual inversion method in the image domain using a single-level token…

Peer Reviews

Decision·ICML 2024 Poster

Reviewer 01Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

1. The tackled problem is both intriguing and holds practical significance. 2. The paper is articulate and systematically presented. 3. The introduction of multi-level token embeddings enhances the textual inversion model. 4. Very strong experiments, which clearly show the superiority of the proposed method.

Weaknesses

1. The main concern I have with this paper is its novelty. While the ideas of multi-level molecule representation and embedding interpolation are well-established in the field, the authors merely integrate them into the newly introduced textual inversion framework. This casts doubts over the paper's genuine novelty and the depth of its technical contribution. 2. The rationale for adopting the textual inversion model appears somewhat nebulous. In my understanding, compared to SMILES, graph repre

Reviewer 02Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

- The problem they are trying to tackle is important and interesting. - The idea looks relatively novel and justified since molecules are constructed of similar smaller components. - The empirical results are promising.

Weaknesses

- The method is not described clearly and in detail. For instance, in the following paragraph of Eq. 1, it is mentioned that the intermediate tokens are "selected" during training. This is unclear and should be discussed in more detail. - Figure 1 is not expressive enough to outline the method.

Reviewer 03Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

1. Introducing the successful textural inversion methods from the computer vision area into the molecular generation area is a good idea. 2. The experimental results presented in the paper demonstrate the effectiveness of the proposed method.

Weaknesses

1. The authors should have a clearer motivation figure in the introduction, which could be specific examples of molecules, to demonstrate that the highly complicated and structured nature of molecules makes it difficult to apply textual inversion directly. 2. The Molecular language model part in the Section 3.2 Preliminaries should be moved to the Related Work section. 3. Table 2 should also show the results of HI-Mol without grammar. 4. In Table 6, Valid decreases as the token hierarchical leve

Code & Models

Repositories

seojin-kim/hi-mol
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenetics, Bioinformatics, and Biomedical Research · Monoclonal and Polyclonal Antibodies Research · Chemical Synthesis and Analysis