Spider: Any-to-Many Multimodal LLM

Jinxiang Lai; Jie Zhang; Jun Liu; Jian Li; Xiaocheng Lu; Song Guo

arXiv:2411.09439·cs.CV·April 8, 2025

Spider: Any-to-Many Multimodal LLM

Jinxiang Lai, Jie Zhang, Jun Liu, Jian Li, Xiaocheng Lu, Song Guo

PDF

Open Access 1 Repo

TL;DR

Spider is a novel framework that enables large language models to generate arbitrary combinations of multiple modalities simultaneously, significantly advancing multimodal interaction capabilities.

Contribution

Introduces Spider, an efficient Any-to-Many Modalities Generation framework with new components and a novel dataset, enabling flexible multimodal content generation beyond pairwise modalities.

Findings

01

Successfully generates arbitrary modality combinations 'Text + Xs'

02

Creates the first X-to-Xs many-modal dataset

03

Enhances multimodal interaction and future research potential

Abstract

Multimodal LLMs (MLLMs) have emerged as an extension of Large Language Models (LLMs), enabling the integration of various modalities. However, Any-to-Any MLLMs are limited to generating pairwise modalities 'Text + X' within a single response, such as Text + {Image or Audio or Video}. To address this limitation, we introduce Spider, a novel efficient Any-to-Many Modalities Generation (AMMG) framework, which can generate an arbitrary combination of modalities 'Text + Xs', such as Text + {Image and Audio and Video}. To achieve efficient AMMG, our Spider integrates three core components: a Base Model for basic X-to-X (i.e., Any-to-Any) modality processing, an Any-to-Many Instruction Template designed for producing Xs signal prompts, and a novel Efficient Decoders-Controller for controlling multimodal Decoders to generate Xs (many-modal) contents. To train Spider, we constructed a novel…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Layjins/Spider
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Semantic Web and Ontologies

MethodsBalanced Selection