To Preserve or To Compress: An In-Depth Study of Connector Selection in   Multimodal Large Language Models

Junyan Lin; Haoran Chen; Dawei Zhu; Xiaoyu Shen

arXiv:2410.06765·cs.CL·October 10, 2024

To Preserve or To Compress: An In-Depth Study of Connector Selection in Multimodal Large Language Models

Junyan Lin, Haoran Chen, Dawei Zhu, Xiaoyu Shen

PDF

Open Access 1 Repo

TL;DR

This study systematically evaluates how different connector types in multimodal large language models affect performance across various perception and reasoning tasks, providing guidance for architecture design.

Contribution

It introduces a unified classification of connectors and benchmarks their impact on diverse perception and reasoning tasks in MLLMs.

Findings

01

Feature-preserving connectors excel in fine-grained perception tasks.

02

Feature-compressing connectors offer speed advantages and perform well in coarse perception and reasoning.

03

Insights guide better MLLM architecture design.

Abstract

In recent years, multimodal large language models (MLLMs) have garnered significant attention from both industry and academia. However, there is still considerable debate on constructing MLLM architectures, particularly regarding the selection of appropriate connectors for perception tasks of varying granularities. This paper systematically investigates the impact of connectors on MLLM performance. Specifically, we classify connectors into feature-preserving and feature-compressing types. Utilizing a unified classification standard, we categorize sub-tasks from three comprehensive benchmarks, MMBench, MME, and SEED-Bench, into three task types: coarse-grained perception, fine-grained perception, and reasoning, and evaluate the performance. Our findings reveal that feature-preserving connectors excel in \emph{fine-grained perception} tasks due to their ability to retain detailed visual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

eit-nlp/connector-selection-for-mllm
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems

MethodsSoftmax · Attention Is All You Need · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings