HSD: Training-Free Acceleration for Document Parsing Vision-Language Model with Hierarchical Speculative Decoding
Wenhui Liao, Hongliang Li, Pengyu Xie, Xinyu Cai, Yufan Shen, Yi Xin, Qi Qin, Shenglong Ye, Tianbin Li, Ming Hu, Junjun He, Yihao Liu, Wenhai Wang, Min Dou, Bin Fu, Botian Shi, Yu Qiao, Lianwen Jin

TL;DR
HSD is a hierarchical decoding framework that accelerates document parsing in vision-language models by predicting regions and verifying drafts in a two-stage process, significantly reducing inference time.
Contribution
The paper introduces Hierarchical Speculative Decoding (HSD), a novel two-stage local-to-global approach that improves inference speed while maintaining global coherence in document parsing.
Findings
Achieves up to 7.04x speedup on long-document parsing tasks.
Provides near-lossless 2.78x speedup with HunyuanOCR on OmniDocBench v1.5.
Effectively preserves full-page coherence through hierarchical verification.
Abstract
Document parsing is a fundamental task in multimodal understanding, supporting a wide range of downstream applications such as information extraction and intelligent document analysis. Benefiting from strong semantic modeling and robust generalization, VLM-based end-to-end approaches have emerged as the mainstream paradigm in recent years. However, these models often suffer from substantial inference latency, as they must autoregressively generate long, full-page sequences when processing long-form documents. While recent hybrid methods mitigate this issue via region-level parallel decoding with VLMs, independent region decoding loses full-page context and might weaken global coherence. To address this issue, we propose Hierarchical Speculative Decoding (HSD), a two-stage local-to-global framework for document parsing. HSD first employs a lightweight pipeline drafter to predict region…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
