Global Context Compression with Interleaved Vision-Text Transformation

Dian Jiao; Jiaxin Duan; Shuai Zhao; Jiabing Leng; Yiran Zhang; Feng Huang

arXiv:2601.10378·cs.CV·January 21, 2026

Global Context Compression with Interleaved Vision-Text Transformation

Dian Jiao, Jiaxin Duan, Shuai Zhao, Jiabing Leng, Yiran Zhang, Feng Huang

PDF

Open Access

TL;DR

This paper introduces VIST2, a Transformer model that interleaves visual and textual information for efficient global context compression, significantly reducing computational costs and memory usage in OCR tasks.

Contribution

VIST2 is a novel Transformer architecture that interleaves visual encodings with text chunks, enabling effective global context compression during both prefilling and inference stages.

Findings

01

Achieves 3x speedup in first-token generation

02

Reduces memory usage by 77%

03

Demonstrates superior performance on long writing tasks

Abstract

Recent achievements of vision-language models in end-to-end OCR point to a new avenue for low-loss compression of textual information. This motivates earlier works that render the Transformer's input into images for prefilling, which effectively reduces the number of tokens through visual encoding, thereby alleviating the quadratically increased Attention computations. However, this partial compression fails to save computational or memory costs at token-by-token inference. In this paper, we investigate global context compression, which saves tokens at both prefilling and inference stages. Consequently, we propose VIST2, a novel Transformer that interleaves input text chunks alongside their visual encoding, while depending exclusively on visual tokens in the pre-context to predict the next text token distribution. Around this idea, we render text chunks into sketch images and train…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Natural Language Processing Techniques · Multimodal Machine Learning Applications