Token-level Correlation-guided Compression for Efficient Multimodal Document Understanding
Renshan Zhang, Yibo Lyu, Rui Shao, Gongwei Chen, Weili Guan, Liqiang, Nie

TL;DR
This paper introduces a novel token-level correlation-guided compression method that adaptively reduces redundant tokens in multimodal document understanding, improving efficiency while maintaining performance.
Contribution
It proposes a parameter-free, plug-and-play compression technique based on token correlation and informativeness, enhancing efficiency in multimodal document understanding models.
Findings
Improves processing speed during training and inference.
Maintains comparable performance with reduced tokens.
Demonstrates effectiveness on state-of-the-art mPLUG-DocOwl1.5 model.
Abstract
Cropping high-resolution document images into multiple sub-images is the most widely used approach for current Multimodal Large Language Models (MLLMs) to do document understanding. Most of current document understanding methods preserve all tokens within sub-images and treat them equally. This neglects their different informativeness and leads to a significant increase in the number of image tokens. To perform a more adaptive and efficient document understanding, we propose Token-level Correlation-guided Compression, a parameter-free and plug-and-play methodology to optimize token processing. Firstly, we propose an innovative approach for assessing the pattern repetitiveness based on the correlation between each patch tokens. This method identifies redundant tokens, allowing for the determination of the sub-image's information density. Secondly, we present a token-level sampling method…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Compression Techniques · Algorithms and Data Compression · Speech Recognition and Synthesis
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
