ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models
Chunjiang Ge, Sijie Cheng, Ziming Wang, Jiale Yuan, Yuan Gao, Jun, Song, Shiji Song, Gao Huang, Bo Zheng

TL;DR
ConvLLaVA introduces a hierarchical visual encoder using ConvNeXt to efficiently process high-resolution images in large multimodal models, reducing redundancy and computational complexity while maintaining competitive performance.
Contribution
It replaces Vision Transformers with ConvNeXt in LMMs, employing optimizations to handle high-resolution images effectively and reduce visual token redundancy.
Findings
Supports 1536x1536 resolution with only 576 visual tokens
Achieves competitive results on mainstream benchmarks
Reduces computational complexity compared to existing models
Abstract
High-resolution Large Multimodal Models (LMMs) encounter the challenges of excessive visual tokens and quadratic visual complexity. Current high-resolution LMMs address the quadratic complexity while still generating excessive visual tokens. However, the redundancy in visual tokens is the key problem as it leads to more substantial compute. To mitigate this issue, we propose ConvLLaVA, which employs ConvNeXt, a hierarchical backbone, as the visual encoder of LMM to replace Vision Transformer (ViT). ConvLLaVA compresses high-resolution images into information-rich visual features, effectively preventing the generation of excessive visual tokens. To enhance the capabilities of ConvLLaVA, we propose two critical optimizations. Since the low-resolution pretrained ConvNeXt underperforms when directly applied on high resolution, we update it to bridge the gap. Moreover, since ConvNeXt's…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗ConvLLaVA/ConvLLaVA-sft-768model· 4 dl· ♡ 14 dl♡ 1
- 🤗ConvLLaVA/ConvLLaVA-sft-1024model· 2 dl2 dl
- 🤗ConvLLaVA/ConvLLaVA-sft-1536model
- 🤗ConvLLaVA/ConvLLaVA-pretrain-768model· 6 dl· ♡ 16 dl♡ 1
- 🤗ConvLLaVA/ConvLLaVA-pretrain-1024model· 3 dl3 dl
- 🤗ConvLLaVA/ConvLLaVA-pretrain-1536model· 10 dl· ♡ 210 dl♡ 2
- 🤗ConvLLaVA/ConvLLaVA-ConvNeXt-768model· 17 dl17 dl
- 🤗ConvLLaVA/ConvLLaVA-ConvNeXt-1024model· 4 dl4 dl
- 🤗ConvLLaVA/ConvLLaVA-ConvNeXt-1536model· 19 dl· ♡ 119 dl♡ 1
- 🤗toshi456/ConvLLaVA-JP-1.3b-768model· 9 dl· ♡ 29 dl♡ 2
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Semantic Web and Ontologies
MethodsAttention Is All You Need · Byte Pair Encoding · Label Smoothing · Adam · Position-Wise Feed-Forward Layer · Dropout · Dense Connections · Absolute Position Encodings · Softmax · ConvNeXt
