Token-level Correlation-guided Compression for Efficient Multimodal   Document Understanding

Renshan Zhang; Yibo Lyu; Rui Shao; Gongwei Chen; Weili Guan; Liqiang; Nie

arXiv:2407.14439·cs.CV·July 22, 2024·2 cites

Token-level Correlation-guided Compression for Efficient Multimodal Document Understanding

Renshan Zhang, Yibo Lyu, Rui Shao, Gongwei Chen, Weili Guan, Liqiang, Nie

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel token-level correlation-guided compression method that adaptively reduces redundant tokens in multimodal document understanding, improving efficiency while maintaining performance.

Contribution

It proposes a parameter-free, plug-and-play compression technique based on token correlation and informativeness, enhancing efficiency in multimodal document understanding models.

Findings

01

Improves processing speed during training and inference.

02

Maintains comparable performance with reduced tokens.

03

Demonstrates effectiveness on state-of-the-art mPLUG-DocOwl1.5 model.

Abstract

Cropping high-resolution document images into multiple sub-images is the most widely used approach for current Multimodal Large Language Models (MLLMs) to do document understanding. Most of current document understanding methods preserve all tokens within sub-images and treat them equally. This neglects their different informativeness and leads to a significant increase in the number of image tokens. To perform a more adaptive and efficient document understanding, we propose Token-level Correlation-guided Compression, a parameter-free and plug-and-play methodology to optimize token processing. Firstly, we propose an innovative approach for assessing the pattern repetitiveness based on the correlation between each patch tokens. This method identifies redundant tokens, allowing for the determination of the sub-image's information density. Secondly, we present a token-level sampling method…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

JiuTian-VL/TokenCorrCompressor
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Compression Techniques · Algorithms and Data Compression · Speech Recognition and Synthesis

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings