ScribeTokens: Fixed-Vocabulary Tokenization of Digital Ink
Douglass Wang

TL;DR
ScribeTokens introduces a fixed-vocabulary tokenization method for digital ink that improves recognition accuracy and training efficiency, outperforming previous vector-based and token-based methods in handwritten text recognition tasks.
Contribution
The paper proposes ScribeTokens, a fixed-vocabulary tokenization for digital ink that enhances recognition performance and training speed, and introduces a self-supervised pretraining strategy for better results.
Findings
ScribeTokens outperforms vector representations in recognition accuracy.
Pretraining with next-ink-token prediction improves convergence and accuracy.
Achieves state-of-the-art results on IAM and DeepWriting datasets.
Abstract
Digital ink -- the coordinate stream captured from stylus or touch input -- lacks a unified representation. Continuous vector representations produce long sequences and suffer from training instability, while existing token representations require large vocabularies, face out-of-vocabulary issues, and underperform vectors on recognition. We propose ScribeTokens, a tokenization that decomposes pen movement into unit pixel steps. Together with two pen-state tokens, this fixed 10-token base vocabulary suffices to represent any digital ink and enables aggressive BPE compression. On handwritten text generation, ScribeTokens dramatically outperforms vectors (17.33% vs. 70.29% CER), showing tokens are far more effective for generation. On recognition, ScribeTokens is the only token representation to outperform vectors without pretraining. We further introduce next-ink-token prediction as a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Generative Adversarial Networks and Image Synthesis · Interactive and Immersive Displays
