OneLatent: Single-Token Compression for Visual Latent Reasoning
Bo Lv, Yasheng Sun, Junjie Wang, Haoxiang Shi

TL;DR
OneLatent introduces a method to compress intermediate reasoning steps into a single latent token using image rendering and OCR supervision, significantly reducing output length and inference cost while maintaining high accuracy.
Contribution
It proposes a novel single-token latent reasoning framework that leverages rendered images and OCR supervision to efficiently condense reasoning processes.
Findings
Reduces output length by 11 times with minimal accuracy loss
Achieves up to 87.4 times compression on reasoning tasks
Maintains high accuracy (over 97%) with single latent tokens on logical reasoning benchmarks
Abstract
Chain-of-thought (CoT) prompting improves reasoning but often increases inference cost by one to two orders of magnitude. To address these challenges, we present \textbf{OneLatent}, a framework that compresses intermediate reasoning into a single latent token via supervision from rendered CoT images and DeepSeek-OCR hidden states. By rendering textual steps into images, we obtain a deterministic supervision signal that can be inspected and audited without requiring the model to output verbose textual rationales. Across benchmarks, OneLatent reduces average output length by with only a average accuracy drop relative to textual CoT, while improving output token contribution (OTC) by . On long-chain logical reasoning, OneLatent reaches on ProntoQA and on ProsQA with one latent token, with compression up to , supporting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Topic Modeling
