Text or Pixels? It Takes Half: On the Token Efficiency of Visual Text Inputs in Multimodal LLMs
Yanhong Li, Zixuan Lan, Jiawei Zhou

TL;DR
This paper demonstrates that rendering long text inputs as images for multimodal large language models significantly reduces token usage by nearly half, while maintaining task performance across benchmarks.
Contribution
It introduces a novel input compression method by converting text to images for decoder LLMs, achieving substantial token savings without performance loss.
Findings
Nearly 50% token reduction in experiments
Effective on long-context retrieval and summarization tasks
Maintains performance with visual text inputs
Abstract
Large language models (LLMs) and their multimodal variants can now process visual inputs, including images of text. This raises an intriguing question: can we compress textual inputs by feeding them as images to reduce token usage while preserving performance? In this paper, we show that visual text representations are a practical and surprisingly effective form of input compression for decoder LLMs. We exploit the idea of rendering long text inputs as a single image and provide it directly to the model. This leads to dramatically reduced number of decoder tokens required, offering a new form of input compression. Through experiments on two distinct benchmarks RULER (long-context retrieval) and CNN/DailyMail (document summarization) we demonstrate that this text-as-image method yields substantial token savings (often nearly half) without degrading task performance.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Natural Language Processing Techniques
