FastOCR: Dynamic Visual Fixation via KV Cache Pruning for Efficient Document Parsing
Zihan Tang, Leqi Shen, Hui Chen, Ao Wang, Ben Wan, Yan Feng, Ke Zhang, Sicheng Zhao, Tongxuan Liu, Guiguang Ding

TL;DR
FastOCR introduces a dynamic, training-free pruning framework that leverages the temporal sparsity of visual attention in OCR tasks to significantly reduce inference costs without accuracy loss.
Contribution
It proposes a novel dynamic pruning method based on visual fixation patterns, avoiding irreversible token removal and enabling efficient document parsing across various vision-language models.
Findings
Retains 98% accuracy while attending to only 5% of tokens.
Reduces attention latency by 3.0 times.
Generalizes across five different VLM architectures.
Abstract
Vision-Language Models (VLMs) have shown strong promise on Optical Character Recognition (OCR), yet the sheer number of visual tokens required to encode dense documents incurs prohibitive inference cost. Existing pruning methods rely on physical eviction, e.g., permanently discarding visual tokens during the prefill stage. While effective for natural images, this strategy fundamentally breaks down on OCR, where virtually every visual token may correspond to a character or structural element, and any irreversible loss leads to catastrophic accuracy degradation. We observe that, although document images appear globally dense and seemingly unprunable, the model's attention to them is in fact temporally sparse: at each decoding step it concentrates on a small region that shifts gradually across steps, much as a human reader fixates on successive words rather than perceiving an entire page…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
