Modality-Specialized Synergizers for Interleaved Vision-Language Generalists
Zhiyang Xu, Minqian Liu, Ying Shen, Joy Rimchala, Jiaxin Zhang, Qifan Wang, Yu Cheng, Lifu Huang

TL;DR
This paper introduces modality-specialized adaptation layers for vision-language models, significantly improving their ability to generate interleaved text and images by leveraging modality-specific features and a new instruction tuning dataset.
Contribution
The paper proposes MOSS, a novel modality-specialized synergizer with adaptation layers for VLGs, and LEAFINSTRUCT, a large interleaved instruction tuning dataset, enhancing interleaved generation capabilities.
Findings
Achieves state-of-the-art performance on complex interleaved tasks.
Demonstrates strong generalizability across different VLG architectures.
Significantly surpasses baseline models in interleaved text-image generation.
Abstract
Recent advancements in Vision-Language Models (VLMs) have led to the emergence of Vision-Language Generalists (VLGs) capable of understanding and generating both text and images. However, seamlessly generating an arbitrary sequence of text and images remains a challenging task for the current VLGs. One primary limitation lies in applying a unified architecture and the same set of parameters to simultaneously model discrete text tokens and continuous image features. Recent works attempt to tackle this fundamental problem by introducing modality-aware expert models. However, they employ identical architectures to process both text and images, disregarding the intrinsic inductive biases in these two modalities. In this work, we introduce MODALITY-SPECIALIZED SYNERGIZERS (MOSS), a novel design that efficiently optimizes existing unified architectures of VLGs with modality-specialized…
Peer Reviews
Decision·ICLR 2025 Poster
1) This paper is written well with clear figures and tables, which make the readers easy to follow the story. 2) The ideas that utilize model-specific adapter to process the text and image make sense to me. Such adapter may capture inherent semantics of corresponding inputs. 3) This paper develops a high-quality interleaved instruction dataset, which will benefit to the VLG community. 4) Experiments and ablations show the improvements and efficiency of the proposed modules
1) One of the core concerns is the novelty of the two LoRA types. Given the facts that both Linear LoRA and convolutional LoRA are not new to recent vision-language models. I think the contributions are limited. 2) Technically, MoSS beliefs that the linear layer could lose the visual information and adopts the convolution LoRA to capture local patch features. However, the ViT-based image encoder in EMU (ViT are often used in VLMs) already flatten an image into a token sequence and employs self-
1. This paper proposed a novel design that enhances VLGs to generate interleaved content with modality-specialized parameters and adaptation architectures. 2. This paper introduces an open-sourced large-scale instruction-tuning dataset that allows interleaved multi-image and text input and output.
1. The proposed convolutional LoRA (Equation 4) is similar to the LoRA proposed in [1]. The authors claim that their new LoRA could alleviate the information loss, yet no experimental comparison between the two kinds of Conv LoRAs is provided. 2. The evaluation is limited. 1. They only evaluate on InterleavedBench and an image editing benchmark called MagicBrush. The coverage of the evaluation is relatively small. 2. Since the proposed model can do both image and text generation, more
1. The paper proposes a novel idea that parameters for processing information of different modalities in VLGs should be trained with different strategies. 2. The proposed MoSS method brings promising enhancement in model performance of VLGs.
1. The performance improvement shown in Table 1 is inconsistent, with Chameleon displaying a decline in text quality after training with MoSS. 2. It remains unclear whether the observed performance enhancements are attributable to the MoSS training method or the LeafInstruct dataset (see Questions). 3. The parameters of ConvLora cannot be merged into the original parameters, as convolution is not linear. This limitation may lead to increased computational costs during inference.
Videos
Taxonomy
TopicsSpeech and Audio Processing · Phonetics and Phonology Research · Hand Gesture Recognition Systems
MethodsOPT
