VTC-R1: Vision-Text Compression for Efficient Long-Context Reasoning

Yibo Wang; Yongcheng Jing; Shunyu Liu; Hao Guan; Rong-cheng Tu; Chengyu Wang; Jun Huang; Dacheng Tao

arXiv:2601.22069·cs.CL·February 3, 2026

VTC-R1: Vision-Text Compression for Efficient Long-Context Reasoning

Yibo Wang, Yongcheng Jing, Shunyu Liu, Hao Guan, Rong-cheng Tu, Chengyu Wang, Jun Huang, Dacheng Tao

PDF

Open Access

TL;DR

VTC-R1 introduces a vision-text compression method that converts reasoning segments into images, enabling more efficient long-context reasoning in vision-language models with significant speed improvements.

Contribution

It proposes a novel vision-text compression paradigm that enhances reasoning efficiency by rendering reasoning steps into images for iterative processing.

Findings

01

Achieves 3.4x token compression on a new dataset.

02

Outperforms standard long-context reasoning benchmarks.

03

Provides 2.7x speedup in inference latency.

Abstract

Long-context reasoning has significantly empowered large language models (LLMs) to tackle complex tasks, yet it introduces severe efficiency bottlenecks due to the computational complexity. Existing efficient approaches often rely on complex additional training or external models for compression, which limits scalability and discards critical fine-grained information. In this paper, we propose VTC-R1, a new efficient reasoning paradigm that integrates vision-text compression into the reasoning process. Instead of processing lengthy textual traces, VTC-R1 renders intermediate reasoning segments into compact images, which are iteratively fed back into vision-language models as "optical memory." We construct a training dataset based on OpenR1-Math-220K achieving 3.4x token compression and fine-tune representative VLMs-Glyph and Qwen3-VL. Extensive experiments on benchmarks such as MATH500,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling