GraphBPE: Molecular Graphs Meet Byte-Pair Encoding
Yuchen Shen, Barnab\'as P\'oczos

TL;DR
GraphBPE introduces a novel graph tokenization method inspired by Byte-Pair Encoding, improving molecular graph data preprocessing and enhancing model performance across various datasets and architectures.
Contribution
This paper presents GraphBPE, a new substructure tokenization technique for molecular graphs that is model-agnostic and improves data preprocessing in molecular machine learning.
Findings
GraphBPE boosts performance on small classification datasets.
It performs comparably with other tokenization methods across architectures.
Data preprocessing significantly impacts molecular graph model performance.
Abstract
With the increasing attention to molecular machine learning, various innovations have been made in designing better models or proposing more comprehensive benchmarks. However, less is studied on the data preprocessing schedule for molecular graphs, where a different view of the molecular graph could potentially boost the model's performance. Inspired by the Byte-Pair Encoding (BPE) algorithm, a subword tokenization method popularly adopted in Natural Language Processing, we propose GraphBPE, which tokenizes a molecular graph into different substructures and acts as a preprocessing schedule independent of the model architectures. Our experiments on 3 graph-level classification and 3 graph-level regression datasets show that data preprocessing could boost the performance of models for molecular graphs, and GraphBPE is effective for small classification datasets and it performs on par with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDNA and Biological Computing · Gene expression and cancer classification · Advanced biosensing and bioanalysis techniques
MethodsSoftmax · Attention Is All You Need
