Up to 36x Speedup: Mask-based Parallel Inference Paradigm for Key Information Extraction in MLLMs
Xinzhong Wang, Ya Guo, Jing Li, Huan Chen, Yi Tu, Yijie Hong, Gongshen Liu, Huijia Zhu

TL;DR
This paper introduces PIP, a parallel inference paradigm for key information extraction in multimodal large language models, achieving up to 36x speedup with minimal performance loss by generating multiple fields simultaneously.
Contribution
The paper proposes a novel mask-based parallel inference method and tailored training strategy, significantly improving efficiency for KIE tasks in MLLMs.
Findings
Achieves 5-36x inference speedup
Maintains high accuracy with negligible performance degradation
Enables scalable real-world KIE applications
Abstract
Key Information Extraction (KIE) from visually-rich documents (VrDs) is a critical task, for which recent Large Language Models (LLMs) and Multi-Modal Large Language Models (MLLMs) have demonstrated strong potential. However, their reliance on autoregressive inference, which generates outputs sequentially, creates a significant efficiency bottleneck, especially as KIE tasks often involve extracting multiple, semantically independent fields. To overcome this limitation, we introduce PIP: a Parallel Inference Paradigm for KIE. Our approach reformulates the problem by using "[mask]" tokens as placeholders for all target values, enabling their simultaneous generation in a single forward pass. To facilitate this paradigm, we develop a tailored mask pre-training strategy and construct large-scale supervised datasets. Experimental results show that our PIP-models achieve a 5-36x inference…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Text Analysis Techniques · Topic Modeling · Handwritten Text Recognition Techniques
