TL;DR
This paper critically evaluates optical context compression, showing that simple direct methods outperform complex vision-based approaches in text reconstruction and language modeling tasks.
Contribution
It demonstrates that complex vision-based compression does not outperform simple baselines, challenging the hype around optical context compression.
Findings
Vision encoder does not outperform simple mean pooling or hierarchical encoding.
Direct methods match or surpass vision in reconstruction at all compression ratios.
Vision performs similarly to truncation in language modeling and does not surpass the best baseline.
Abstract
DeepSeek-OCR shows that rendered text can be reconstructed from a small number of vision tokens, sparking excitement about using vision as a compression medium for long textual contexts. But this pipeline requires rendering token embeddings to pixels and compressing from there -- discarding learned representations in favor of an image the vision encoder must then recover from. We ask whether this detour helps. Comparing DeepSeek-OCR's vision encoder against near-zero-parameter mean pooling and a learned hierarchical encoder, we find it does not. For reconstruction, simple direct methods match or surpass vision at every compression ratio. For language modeling, vision performs comparably to truncation -- a baseline that simply discards context -- and loses to the hierarchical encoder at every compression ratio. As expected, all compression methods outperform truncation for factual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
