SEED-X: Multimodal Models with Unified Multi-granularity Comprehension   and Generation

Yuying Ge; Sijie Zhao; Jinguo Zhu; Yixiao Ge; Kun Yi; Lin Song; Chen; Li; Xiaohan Ding; Ying Shan

arXiv:2404.14396·cs.CV·March 4, 2025·1 cites

SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation

Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen, Li, Xiaohan Ding, Ying Shan

PDF

Open Access 1 Repo 1 Models 4 Datasets

TL;DR

SEED-X is a versatile multimodal foundation model capable of understanding and generating images of various sizes and granularities, improving real-world applicability in vision-language tasks.

Contribution

The paper introduces SEED-X, a unified model that enhances multi-granularity image comprehension and generation, addressing limitations of previous models in real-world scenarios.

Findings

01

Achieves competitive results on public benchmarks.

02

Effectively handles images of arbitrary sizes and ratios.

03

Demonstrates strong performance in real-world applications.

Abstract

The rapid evolution of multimodal foundation model has demonstrated significant progresses in vision-language understanding and generation, e.g., our previous work SEED-LLaMA. However, there remains a gap between its capability and the real-world applicability, primarily due to the model's limited capacity to effectively respond to various user instructions and interact with diverse visual data. In this work, we focus on bridging this gap through integrating two enhanced features: (1) comprehending images of arbitrary sizes and ratios, and (2) enabling multi-granularity image generation. We present a unified and versatile foundation model, namely, SEED-X, which is able to model multi-granularity visual semantics for comprehension and generation tasks. Besides the competitive results on public benchmarks, SEED-X demonstrates its effectiveness in handling real-world applications across…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ailab-cvc/seed-x
pytorchOfficial

Models

🤗
AILab-CVC/SEED-X-17B
model· ♡ 16
♡ 16

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems

MethodsFocus