TL;DR
LLaVA-UHD v4 introduces a slice-based encoding and intra-ViT early compression to significantly reduce visual-encoding FLOPs in high-resolution MLLMs while maintaining or improving performance.
Contribution
It proposes a novel intra-ViT early compression method combined with slice-based encoding, enhancing efficiency in high-resolution visual encoding for MLLMs.
Findings
Reduces visual-encoding FLOPs by 55.8%
Outperforms global encoding in benchmarks
Maintains or surpasses baseline performance
Abstract
Visual encoding constitutes a major computational bottleneck in Multimodal Large Language Models (MLLMs), especially for high-resolution image inputs. The prevailing practice typically adopts global encoding followed by post-ViT compression. Global encoding produces massive token sequences, while post-ViT compression incurs the full quadratic attention cost of the ViT before any token reduction takes place. In this work, we revisit this convention along two dimensions: the encoding strategy and visual token compression. First, controlled experiments show that slice-based encoding outperforms global encoding across benchmarks, suggesting that preserving local details through sliced views can be more beneficial than applying global attention for fine-grained perception. Second, we introduce intra-ViT early compression, which reduces tokens in shallow ViT layers and substantially lowers…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗openbmb/MiniCPM-V-4.6model· 222k dl· ♡ 906222k dl♡ 906
- 🤗openbmb/MiniCPM-V-4.6-ggufmodel· 26k dl· ♡ 2726k dl♡ 27
- 🤗openbmb/MiniCPM-V-4.6-Thinkingmodel· 29k dl· ♡ 2329k dl♡ 23
- 🤗openbmb/MiniCPM-V-4.6-Thinking-ggufmodel· 12k dl· ♡ 1412k dl♡ 14
- 🤗openbmb/MiniCPM-V-4.6-BNBmodel· 1.3k dl· ♡ 61.3k dl♡ 6
- 🤗heretic-org/MiniCPM-V-4.6-hereticmodel· 158 dl· ♡ 2158 dl♡ 2
- 🤗ZENLLC/ZEN-MiniCPM-V-4.6model· 17 dl· ♡ 117 dl♡ 1
- 🤗openbmb/MiniCPM-V-4.6-AWQmodel· 1.9k dl· ♡ 31.9k dl♡ 3
- 🤗openbmb/MiniCPM-V-4.6-GPTQmodel· 1.4k dl· ♡ 31.4k dl♡ 3
- 🤗openbmb/MiniCPM-V-4.6-Thinking-GPTQmodel· 1.0k dl· ♡ 41.0k dl♡ 4
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
