The Thinking Pixel: Recursive Sparse Reasoning in Multimodal Diffusion Latents
Yuwei Sun, Yuxuan Yao, Hui Li, Siyu Zhu

TL;DR
This paper introduces a recursive, sparse mixture-of-experts framework integrated into diffusion models to improve structured reasoning and image generation quality in multimodal tasks.
Contribution
It presents a novel recursive, sparse mixture-of-experts approach with dynamic module selection within diffusion models for enhanced multimodal reasoning.
Findings
Outperforms existing models on ImageNet class-conditioned image generation.
Demonstrates improved performance on GenEval and DPG benchmarks.
Efficiently refines visual tokens over multiple latent steps.
Abstract
Diffusion models have achieved success in high-fidelity data synthesis, yet their capacity for more complex, structured reasoning like text following tasks remains constrained. While advances in language models have leveraged strategies such as latent reasoning and recursion to enhance text understanding capabilities, extending these to multimodal text-to-image generation tasks is challenging due to the continuous and non-discrete nature of visual tokens. To tackle this problem, we draw inspiration from modular human cognition and propose a recursive, sparse mixture-of-experts framework integrated into conventional diffusion models. Our approach introduces a recursive component within joint attention layers that iteratively refines visual tokens over multiple latent steps while efficiently sharing parameters via sparse selection of neural modules. At each step, a gating network is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
