PUMA: Empowering Unified MLLM with Multi-granular Visual Generation

Rongyao Fang; Chengqi Duan; Kun Wang; Hao Li; Hao Tian; Xingyu Zeng,; Rui Zhao; Jifeng Dai; Hongsheng Li; Xihui Liu

arXiv:2410.13861·cs.CV·October 22, 2024

PUMA: Empowering Unified MLLM with Multi-granular Visual Generation

Rongyao Fang, Chengqi Duan, Kun Wang, Hao Li, Hao Tian, Xingyu Zeng,, Rui Zhao, Jifeng Dai, Hongsheng Li, Xihui Liu

PDF

Open Access 1 Repo 1 Models 4 Reviews

TL;DR

PUMA introduces a unified multimodal large language model that effectively handles multi-granular visual generation tasks, from diverse text-to-image creation to precise image manipulation, by integrating multi-granular visual features.

Contribution

It proposes a novel framework that unifies multi-granular visual features as inputs and outputs within a single MLLM, addressing diverse visual task requirements.

Findings

01

Demonstrates proficiency across various multimodal tasks

02

Achieves effective multi-granular visual generation

03

Advances towards a truly unified MLLM framework

Abstract

Recent advancements in multimodal foundation models have yielded significant progress in vision-language understanding. Initial attempts have also explored the potential of multimodal large language models (MLLMs) for visual content generation. However, existing works have insufficiently addressed the varying granularity demands of different image generation tasks within a unified MLLM paradigm - from the diversity required in text-to-image generation to the precise controllability needed in image manipulation. In this work, we propose PUMA, emPowering Unified MLLM with Multi-grAnular visual generation. PUMA unifies multi-granular visual features as both inputs and outputs of MLLMs, elegantly addressing the different granularity requirements of various image generation tasks within a unified MLLM framework. Following multimodal pretraining and task-specific instruction tuning, PUMA…

Peer Reviews

Decision·ICLR 2025 Conference Withdrawn Submission

Reviewer 01Rating 3Confidence 5

Strengths

* The idea of multiscale features in the image generation domain for targeting a diverse array of tasks is original and useful. Some features at some scale might benefit one task while other features might be beneficial for other tasks as shown in some of the experiments. * The paper makes a compelling argument for the need of unifying generation and understanding and represents an attempt in this direction. * The paper is well written and well organized in general.

Weaknesses

* The main concerns is that the paper does not fulfill what seems to be its main promise: To have a single integrated model that can perform generative and understanding tasks. The models that are evaluated for each of the four tasks are different model checkpoints that are produced by targeting each downstream task specifically through different finetuning datasets. * I also have some concern with not including FID (Frechet Inception Distance) as is customary to evaluate generative tasks. Whil

Reviewer 02Rating 5Confidence 3

Strengths

Originality: To the best of my knowledge, this paper seems to be the first to combine multi-granularity image and text modeling. This makes it interesting and could spark some directions in the future. Quality: The thorough evaluations a presented in the experimental section are of fair quality which helps the reviewers make a better assessment of the contributions and significance of the paper. Clarity: The presentation of some of the ablations could use some work. Please see weaknesses secti

Weaknesses

* Understanding is negatively affected by multi-granularity modeling. Looking at Table 5 in the appendix, it is not clear whether multi-granularity modeling adds to the understanding ability of the model. For half of the metrics the performance becomes worse, for one is the same, and only for a single metric the performance actually seems to improve. Do the authors have an intuition of why this is the case? Is spending part of the compute budget in the coarse level features the reason for this?

Reviewer 03Rating 5Confidence 4

Strengths

1. The idea of multi-granularity for image generation is interesting. 2. The visualization of multi-granularity images is clear and the performance looks good. 3. The paper is well written and easy to follow.

Weaknesses

1. The novelty of this paper seems limited; although it claims to introduce multi-granular visual generation, many papers (e.g., SEED-X[1], Matryoshka[2]) already focus on multi-granularity visual features, both in MLLM and image generation. Furthermore, the implementation of multi-granularity merely involves using different pooled visual features and different diffusion decoders, which also introduces additional parameters and computation. 2. In Table 1, is PUMA using (1+4+16+64+256) visual tok

Reviewer 04Rating 6Confidence 3

Strengths

The paper clearly stated the problem, which is a common challenge for MLLM that needs to balance the different needs of multiple visual generation tasks. Solving such challenge will definitely help improve the overall capability of the MLLMs. The proposed solution is relatively simple (in a good way) and clean, and is in principle straightforward to implement. Experiments show that the proposed approach is also capable of solving the proposed problem, which helps both high diversity and high c

Weaknesses

The simple design will lead to a model that's slow in inference time: a), the image feature sequence length is essentially doubled with a lot of redundant information. Leading to a larger context window in inference time, and 2x inference cost as the output sequence length will be doubled for visual generation tasks. b) several images has to be generated, each from a different diffusion decoder for the corresponding feature granularity. In the end, only one of these outputs will be used as the

Code & Models

Repositories

rongyaofang/puma
pytorchOfficial

Models

🤗
LucasFang/PUMA
model· ♡ 2
♡ 2

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Semantic Web and Ontologies