Graph Tokenization for Bridging Graphs and Transformers
Zeyuan Guo, Enmao Diao, Cheng Yang, Chuan Shi

TL;DR
This paper introduces a novel graph tokenization method that converts graphs into sequences compatible with Transformers, enabling direct application to graph data and achieving state-of-the-art results without changing model architecture.
Contribution
The work presents a graph tokenization framework combining reversible serialization and BPE, effectively bridging graph data with sequence models like Transformers.
Findings
Achieves state-of-the-art results on 14 graph benchmarks.
Enables direct use of Transformers on graph data without architectural changes.
Outperforms traditional graph neural networks and specialized graph transformers.
Abstract
The success of large pretrained Transformers is closely tied to tokenizers, which convert raw input into discrete symbols. Extending these models to graph-structured data remains a significant challenge. In this work, we introduce a graph tokenization framework that generates sequential representations of graphs by combining reversible graph serialization, which preserves graph information, with Byte Pair Encoding (BPE), a widely adopted tokenizer in large language models (LLMs). To better capture structural information, the graph serialization process is guided by global statistics of graph substructures, ensuring that frequently occurring substructures appear more often in the sequence and can be merged by BPE into meaningful tokens. Empirical results demonstrate that the proposed tokenizer enables Transformers such as BERT to be directly applied to graph benchmarks without…
Peer Reviews
Decision·ICLR 2026 Poster
1. While the high level idea is not new, the proposed work executed the graph tokenization fairly well, and showcased good performance in the tasks that it can handle. 2. In the evaluations, the proposed method showed good performances as well as efficiency.
The limitations on page 14 are very on point. 1. As limitations 1&2 mentioned, the focus on only graph-level tasks with discrete features significantly constrains the scope of this work. With such constrains, the proposed method almost only makes sense on protein and chemical graphs, where nodes are atoms, etc. 2. All three limitations kinda showed a theme that this proposed work might not be very suitable for larger graphs such as social networks.
The proposed approach is elegant and cleanly integrates graph inputs with established LLM best practices. The combination of BPE with graph walks is clever, well-motivated, and useful for inputting graphs directly into Transformers, effectively sidestepping the need for specialized GNN architectures. The experiments appear strong and thorough. They demonstrate compelling compression rates from BPE and show that the method beats a variety of GNN models, although I note that I lack full context f
Regarding reversibility: Output sequences consist only of standard labels (Eq. 1), not unique node identifiers. In graphs with many identically labeled nodes (e.g., large carbon lattices), how does the decoder $f^{-1}$ explicitly distinguish between returning to a previously visited node versus arriving at a new node with the same label? Is reversibility guaranteed for all labeled graphs without positional markers? For one particular example, what structure can be accurately recovered from the m
Approach: The idea of combining reversible graph serialization with BPE, with formal analysis showing why existing methods fail to satisfy both reversibility and determinism is interesting. Empirical Performance: The method achieves state-of-the-art results on 12 out of 13 benchmarks and it also demonstrated that properly tokenized graphs enable standard Transformers to outperform specialized graph architectures without any modifications. Efficiency Gains: BPE compression achieves 6-10× reduct
Scope: The paper only evaluates graph-level classification and regression tasks. Despite claiming to "bridge graphs and transformers," node-level and edge-level prediction tasks are entirely absent from evaluation. The authors acknowledge this limitation (Appendix A.1) but provide no experimental validation of their proposed solutions. This significantly undermines the claim of providing a general framework for graph learning. Continuous Feature Problem: The framework fundamentally requires di
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Graph Neural Networks · Graph Theory and Algorithms · Multimodal Machine Learning Applications
