Investigation of chemical structure recognition by encoder-decoder models in learning progress
Shumpei Nemoto, Tadahaya Mizuno, Hiroyuki Kusuhara

TL;DR
This study investigates how encoder-decoder models recognize chemical structures during learning, revealing early substructure learning and challenges in accurate structure restoration, thus providing new insights into model evaluation and structure understanding.
Contribution
It is the first to analyze the relationship between learning progress and chemical structure recognition in encoder-decoder models using SMILES representations.
Findings
Substructures are learned early in training.
Existing evaluation metrics may not fully capture structure learning.
Structure restoration is time-consuming and can overestimate structures.
Abstract
Descriptor generation methods using latent representations of encoderdecoder (ED) models with SMILES as input are useful because of the continuity of descriptor and restorability to the structure. However, it is not clear how the structure is recognized in the learning progress of ED models. In this work, we created ED models of various learning progress and investigated the relationship between structural information and learning progress. We showed that compound substructures were learned early in ED models by monitoring the accuracy of downstream tasks and inputoutput substructure similarity using substructurebased descriptors, which suggests that existing evaluation methods based on the accuracy of downstream tasks may not be sensitive enough to evaluate the performance of ED models with SMILES as descriptor generation methods. On the other hand, we showed that structure…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Computational Drug Discovery Methods
