Specialising and Analysing Instruction-Tuned and Byte-Level Language   Models for Organic Reaction Prediction

Jiayun Pang; Ivan Vuli\'c

arXiv:2405.10625·cs.CL·May 20, 2024

Specialising and Analysing Instruction-Tuned and Byte-Level Language Models for Organic Reaction Prediction

Jiayun Pang, Ivan Vuli\'c

PDF

Open Access

TL;DR

This study evaluates the effectiveness of language-pretrained models FlanT5 and ByT5, fine-tuned for organic reaction prediction, demonstrating that pretraining on language data alone can be sufficient for chemistry tasks.

Contribution

It systematically investigates how language models pretrained on general text can be adapted for chemical reaction prediction, highlighting their domain compatibility and efficiency.

Findings

01

Language models pretrained on text can be effectively fine-tuned for chemistry.

02

Tokenisation and vocabulary trimming influence performance and speed.

03

Simple greedy decoding performs competitively with more complex algorithms.

Abstract

Transformer-based encoder-decoder models have demonstrated impressive results in chemical reaction prediction tasks. However, these models typically rely on pretraining using tens of millions of unlabelled molecules, which can be time-consuming and GPU-intensive. One of the central questions we aim to answer in this work is: Can FlanT5 and ByT5, the encode-decoder models pretrained solely on language data, be effectively specialised for organic reaction prediction through task-specific fine-tuning? We conduct a systematic empirical study on several key issues of the process, including tokenisation, the impact of (SMILES-oriented) pretraining, fine-tuning sample efficiency, and decoding algorithms at inference. Our key findings indicate that although being pretrained only on language tasks, FlanT5 and ByT5 provide a solid foundation to fine-tune for reaction prediction, and thus become…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Materials Science · Topic Modeling

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings