ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal   Models

Chunjiang Ge; Sijie Cheng; Ziming Wang; Jiale Yuan; Yuan Gao; Jun; Song; Shiji Song; Gao Huang; Bo Zheng

arXiv:2405.15738·cs.CV·May 27, 2024·1 cites

ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models

Chunjiang Ge, Sijie Cheng, Ziming Wang, Jiale Yuan, Yuan Gao, Jun, Song, Shiji Song, Gao Huang, Bo Zheng

PDF

Open Access 1 Repo 10 Models

TL;DR

ConvLLaVA introduces a hierarchical visual encoder using ConvNeXt to efficiently process high-resolution images in large multimodal models, reducing redundancy and computational complexity while maintaining competitive performance.

Contribution

It replaces Vision Transformers with ConvNeXt in LMMs, employing optimizations to handle high-resolution images effectively and reduce visual token redundancy.

Findings

01

Supports 1536x1536 resolution with only 576 visual tokens

02

Achieves competitive results on mainstream benchmarks

03

Reduces computational complexity compared to existing models

Abstract

High-resolution Large Multimodal Models (LMMs) encounter the challenges of excessive visual tokens and quadratic visual complexity. Current high-resolution LMMs address the quadratic complexity while still generating excessive visual tokens. However, the redundancy in visual tokens is the key problem as it leads to more substantial compute. To mitigate this issue, we propose ConvLLaVA, which employs ConvNeXt, a hierarchical backbone, as the visual encoder of LMM to replace Vision Transformer (ViT). ConvLLaVA compresses high-resolution images into information-rich visual features, effectively preventing the generation of excessive visual tokens. To enhance the capabilities of ConvLLaVA, we propose two critical optimizations. Since the low-resolution pretrained ConvNeXt underperforms when directly applied on high resolution, we update it to bridge the gap. Moreover, since ConvNeXt's…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

alibaba/conv-llava
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Semantic Web and Ontologies

MethodsAttention Is All You Need · Byte Pair Encoding · Label Smoothing · Adam · Position-Wise Feed-Forward Layer · Dropout · Dense Connections · Absolute Position Encodings · Softmax · ConvNeXt