See the Text: From Tokenization to Visual Reading
Ling Xing, Rui Yan, Alex Jinpeng Wang, Zechao Li, Jinhui Tang

TL;DR
This paper introduces SeeTok, a visual-text reading method for language models that interprets text as images, reducing token count and computational cost while improving robustness and cross-lingual performance.
Contribution
SeeTok presents a novel visual-based approach to text interpretation, challenging tokenization and leveraging multimodal models for more natural language understanding.
Findings
Matches or surpasses subword tokenizers in three language tasks.
Requires 4.43 times fewer tokens and reduces FLOPs by 70.5%.
Improves robustness to typographic noise and cross-lingual generalization.
Abstract
People see text. Humans read by recognizing words as visual objects, including their shapes, layouts, and patterns, before connecting them to meaning, which enables us to handle typos, distorted fonts, and various scripts effectively. Modern large language models (LLMs), however, rely on subword tokenization, fragmenting text into pieces from a fixed vocabulary. While effective for high-resource languages, this approach over-segments low-resource languages, yielding long, linguistically meaningless sequences and inflating computation. In this work, we challenge this entrenched paradigm and move toward a vision-centric alternative. Our method, SeeTok, renders text as images (visual-text) and leverages pretrained multimodal LLMs to interpret them, reusing strong OCR and text-vision alignment abilities learned from large-scale multimodal training. Across three different language tasks,…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The paper presents a way to leverage the visual tokenization for text processing. By leveraging the pretrained MLLMs with LoRA adaptation is efficient and pragmatic, avoiding expensive training from scratch. 2. The proposed SEETOK demonstrates good performance on several categories of tasks, including QA, translation, cross-lingual transfer etc. And also shows it robustness. 3. The proposed SEETOK tokenizer also shows its superb efficiency in token comprehension and FLOP reduction.
1. Overall, it is interesting to see that introducing the vision tokenization as the tokenzier for text. While, for the detokenization part, it still rely on the traditional text tokenizer, may still suffer from the existing issues for text tokenizer. This might be not a perfect point. So, how to implement a full vision-centric tokenizer here for text? 2. Table 1 shows SEETOK underperforms on MMLU (52.52 vs 61.91) and NQ (24.14 v.s. 29.31), this might be a significant limitation for knowledge-h
The idea of using pure visual tokens as input is well motivated. The experiments to reveal the benefit of using visual-centric tokenization are comprehensive. The Perturbation Probing study is an interesting investigation that reveals the unique advantage of holistic perception using visual tokens to represent text.
1. My major concern remains whether the performance gain comes from the finetuning process or SEETOK itself, especially considering the low performance on SST5 for the original Qwen2.5-VL 3B model. Although it’s discussed in Table 6, it would strengthen the paper’s conclusion to also show the evaluation results for all the datasets evaluated in Table 1 and Table 3, rather than only showing the results of MMLU. I would be more curious about the results of TriviaQA and SST5. 2. Line 429: “even t
1. Clear engineering contribution: The paper proposes a practical and well-motivated framework that treats rendered text as input for vision encoders in existing MLLMs. This vision-as-tokenizer approach is modular, minimally invasive, and compatible with off-the-shelf MLLMs, requiring only LoRA-based tuning without retraining the full model. 2. System-level benefits: The authors go beyond token count reduction to quantify end-to-end FLOPs and latency improvements (e.g., 70.5% FLOP reduction and
**Major Issues** 1. Incremental novelty: While the integration into MLLMs is elegant, the central idea, rendering text as images and processing it via vision encoders, has been extensively explored in prior work (e.g., PIXEL, CLIPPO, and CLIP-style MLLMs). Many of the claimed benefits (e.g., robustness to perturbation, fertility reduction, multilingual fairness) are inherent not specific to SEETOK. The primary contribution lies in integrating this paradigm into general-purpose MLLMs with lora f
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Handwritten Text Recognition Techniques
