PUMA: Empowering Unified MLLM with Multi-granular Visual Generation
Rongyao Fang, Chengqi Duan, Kun Wang, Hao Li, Hao Tian, Xingyu Zeng,, Rui Zhao, Jifeng Dai, Hongsheng Li, Xihui Liu

TL;DR
PUMA introduces a unified multimodal large language model that effectively handles multi-granular visual generation tasks, from diverse text-to-image creation to precise image manipulation, by integrating multi-granular visual features.
Contribution
It proposes a novel framework that unifies multi-granular visual features as inputs and outputs within a single MLLM, addressing diverse visual task requirements.
Findings
Demonstrates proficiency across various multimodal tasks
Achieves effective multi-granular visual generation
Advances towards a truly unified MLLM framework
Abstract
Recent advancements in multimodal foundation models have yielded significant progress in vision-language understanding. Initial attempts have also explored the potential of multimodal large language models (MLLMs) for visual content generation. However, existing works have insufficiently addressed the varying granularity demands of different image generation tasks within a unified MLLM paradigm - from the diversity required in text-to-image generation to the precise controllability needed in image manipulation. In this work, we propose PUMA, emPowering Unified MLLM with Multi-grAnular visual generation. PUMA unifies multi-granular visual features as both inputs and outputs of MLLMs, elegantly addressing the different granularity requirements of various image generation tasks within a unified MLLM framework. Following multimodal pretraining and task-specific instruction tuning, PUMA…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
* The idea of multiscale features in the image generation domain for targeting a diverse array of tasks is original and useful. Some features at some scale might benefit one task while other features might be beneficial for other tasks as shown in some of the experiments. * The paper makes a compelling argument for the need of unifying generation and understanding and represents an attempt in this direction. * The paper is well written and well organized in general.
* The main concerns is that the paper does not fulfill what seems to be its main promise: To have a single integrated model that can perform generative and understanding tasks. The models that are evaluated for each of the four tasks are different model checkpoints that are produced by targeting each downstream task specifically through different finetuning datasets. * I also have some concern with not including FID (Frechet Inception Distance) as is customary to evaluate generative tasks. Whil
Originality: To the best of my knowledge, this paper seems to be the first to combine multi-granularity image and text modeling. This makes it interesting and could spark some directions in the future. Quality: The thorough evaluations a presented in the experimental section are of fair quality which helps the reviewers make a better assessment of the contributions and significance of the paper. Clarity: The presentation of some of the ablations could use some work. Please see weaknesses secti
* Understanding is negatively affected by multi-granularity modeling. Looking at Table 5 in the appendix, it is not clear whether multi-granularity modeling adds to the understanding ability of the model. For half of the metrics the performance becomes worse, for one is the same, and only for a single metric the performance actually seems to improve. Do the authors have an intuition of why this is the case? Is spending part of the compute budget in the coarse level features the reason for this?
1. The idea of multi-granularity for image generation is interesting. 2. The visualization of multi-granularity images is clear and the performance looks good. 3. The paper is well written and easy to follow.
1. The novelty of this paper seems limited; although it claims to introduce multi-granular visual generation, many papers (e.g., SEED-X[1], Matryoshka[2]) already focus on multi-granularity visual features, both in MLLM and image generation. Furthermore, the implementation of multi-granularity merely involves using different pooled visual features and different diffusion decoders, which also introduces additional parameters and computation. 2. In Table 1, is PUMA using (1+4+16+64+256) visual tok
The paper clearly stated the problem, which is a common challenge for MLLM that needs to balance the different needs of multiple visual generation tasks. Solving such challenge will definitely help improve the overall capability of the MLLMs. The proposed solution is relatively simple (in a good way) and clean, and is in principle straightforward to implement. Experiments show that the proposed approach is also capable of solving the proposed problem, which helps both high diversity and high c
The simple design will lead to a model that's slow in inference time: a), the image feature sequence length is essentially doubled with a lot of redundant information. Leading to a larger context window in inference time, and 2x inference cost as the output sequence length will be doubled for visual generation tasks. b) several images has to be generated, each from a different diffusion decoder for the corresponding feature granularity. In the end, only one of these outputs will be used as the
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Semantic Web and Ontologies
