Specialising and Analysing Instruction-Tuned and Byte-Level Language Models for Organic Reaction Prediction
Jiayun Pang, Ivan Vuli\'c

TL;DR
This study evaluates the effectiveness of language-pretrained models FlanT5 and ByT5, fine-tuned for organic reaction prediction, demonstrating that pretraining on language data alone can be sufficient for chemistry tasks.
Contribution
It systematically investigates how language models pretrained on general text can be adapted for chemical reaction prediction, highlighting their domain compatibility and efficiency.
Findings
Language models pretrained on text can be effectively fine-tuned for chemistry.
Tokenisation and vocabulary trimming influence performance and speed.
Simple greedy decoding performs competitively with more complex algorithms.
Abstract
Transformer-based encoder-decoder models have demonstrated impressive results in chemical reaction prediction tasks. However, these models typically rely on pretraining using tens of millions of unlabelled molecules, which can be time-consuming and GPU-intensive. One of the central questions we aim to answer in this work is: Can FlanT5 and ByT5, the encode-decoder models pretrained solely on language data, be effectively specialised for organic reaction prediction through task-specific fine-tuning? We conduct a systematic empirical study on several key issues of the process, including tokenisation, the impact of (SMILES-oriented) pretraining, fine-tuning sample efficiency, and decoding algorithms at inference. Our key findings indicate that although being pretrained only on language tasks, FlanT5 and ByT5 provide a solid foundation to fine-tune for reaction prediction, and thus become…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Topic Modeling
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
