Mini-Monkey: Alleviating the Semantic Sawtooth Effect for Lightweight MLLMs via Complementary Image Pyramid
Mingxin Huang, Yuliang Liu, Dingkang Liang, Lianwen Jin and, Xiang Bai

TL;DR
Mini-Monkey introduces a Complementary Image Pyramid and Scale Compression Mechanism to improve high-resolution image understanding in lightweight multimodal large language models, significantly enhancing performance with minimal computational cost.
Contribution
The paper proposes a novel Complementary Image Pyramid and Scale Compression Mechanism to mitigate semantic discontinuity in lightweight MLLMs during high-resolution image processing.
Findings
CIP improves performance across various architectures and capacities.
Mini-Monkey surpasses larger models in OCRBench tasks.
Training Mini-Monkey is cost-effective, requiring only eight GPUs.
Abstract
Recently, scaling images to high resolution has received much attention in multimodal large language models (MLLMs). Most existing practices adopt a sliding-window-style cropping strategy to adapt to resolution increase. Such a cropping strategy, however, can easily cut off objects and connected regions, which introduces semantic discontinuity and therefore impedes MLLMs from recognizing small or irregularly shaped objects or text, leading to a phenomenon we call the semantic sawtooth effect. This effect is particularly evident in lightweight MLLMs. To address this issue, we introduce a Complementary Image Pyramid (CIP), a simple, effective, and plug-and-play solution designed to mitigate semantic discontinuity during high-resolution image processing. In particular, CIP dynamically constructs an image pyramid to provide complementary semantic information for the cropping-based MLLMs,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsInteractive and Immersive Displays · BIM and Construction Integration · Augmented Reality Applications
MethodsSoftmax · Attention Is All You Need · Focus
