Spider: Any-to-Many Multimodal LLM
Jinxiang Lai, Jie Zhang, Jun Liu, Jian Li, Xiaocheng Lu, Song Guo

TL;DR
Spider is a novel framework that enables large language models to generate arbitrary combinations of multiple modalities simultaneously, significantly advancing multimodal interaction capabilities.
Contribution
Introduces Spider, an efficient Any-to-Many Modalities Generation framework with new components and a novel dataset, enabling flexible multimodal content generation beyond pairwise modalities.
Findings
Successfully generates arbitrary modality combinations 'Text + Xs'
Creates the first X-to-Xs many-modal dataset
Enhances multimodal interaction and future research potential
Abstract
Multimodal LLMs (MLLMs) have emerged as an extension of Large Language Models (LLMs), enabling the integration of various modalities. However, Any-to-Any MLLMs are limited to generating pairwise modalities 'Text + X' within a single response, such as Text + {Image or Audio or Video}. To address this limitation, we introduce Spider, a novel efficient Any-to-Many Modalities Generation (AMMG) framework, which can generate an arbitrary combination of modalities 'Text + Xs', such as Text + {Image and Audio and Video}. To achieve efficient AMMG, our Spider integrates three core components: a Base Model for basic X-to-X (i.e., Any-to-Any) modality processing, an Any-to-Many Instruction Template designed for producing Xs signal prompts, and a novel Efficient Decoders-Controller for controlling multimodal Decoders to generate Xs (many-modal) contents. To train Spider, we constructed a novel…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Semantic Web and Ontologies
MethodsBalanced Selection
