Text or Pixels? It Takes Half: On the Token Efficiency of Visual Text Inputs in Multimodal LLMs

Yanhong Li; Zixuan Lan; Jiawei Zhou

arXiv:2510.18279·cs.CL·October 23, 2025

Text or Pixels? It Takes Half: On the Token Efficiency of Visual Text Inputs in Multimodal LLMs

Yanhong Li, Zixuan Lan, Jiawei Zhou

PDF

Open Access

TL;DR

This paper demonstrates that rendering long text inputs as images for multimodal large language models significantly reduces token usage by nearly half, while maintaining task performance across benchmarks.

Contribution

It introduces a novel input compression method by converting text to images for decoder LLMs, achieving substantial token savings without performance loss.

Findings

01

Nearly 50% token reduction in experiments

02

Effective on long-context retrieval and summarization tasks

03

Maintains performance with visual text inputs

Abstract

Large language models (LLMs) and their multimodal variants can now process visual inputs, including images of text. This raises an intriguing question: can we compress textual inputs by feeding them as images to reduce token usage while preserving performance? In this paper, we show that visual text representations are a practical and surprisingly effective form of input compression for decoder LLMs. We exploit the idea of rendering long text inputs as a single image and provide it directly to the model. This leads to dramatically reduced number of decoder tokens required, offering a new form of input compression. Through experiments on two distinct benchmarks RULER (long-context retrieval) and CNN/DailyMail (document summarization) we demonstrate that this text-as-image method yields substantial token savings (often nearly half) without degrading task performance.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Natural Language Processing Techniques