The Thinking Pixel: Recursive Sparse Reasoning in Multimodal Diffusion Latents

Yuwei Sun; Yuxuan Yao; Hui Li; Siyu Zhu

arXiv:2604.25299·cs.CV·April 29, 2026

The Thinking Pixel: Recursive Sparse Reasoning in Multimodal Diffusion Latents

Yuwei Sun, Yuxuan Yao, Hui Li, Siyu Zhu

PDF

TL;DR

This paper introduces a recursive, sparse mixture-of-experts framework integrated into diffusion models to improve structured reasoning and image generation quality in multimodal tasks.

Contribution

It presents a novel recursive, sparse mixture-of-experts approach with dynamic module selection within diffusion models for enhanced multimodal reasoning.

Findings

01

Outperforms existing models on ImageNet class-conditioned image generation.

02

Demonstrates improved performance on GenEval and DPG benchmarks.

03

Efficiently refines visual tokens over multiple latent steps.

Abstract

Diffusion models have achieved success in high-fidelity data synthesis, yet their capacity for more complex, structured reasoning like text following tasks remains constrained. While advances in language models have leveraged strategies such as latent reasoning and recursion to enhance text understanding capabilities, extending these to multimodal text-to-image generation tasks is challenging due to the continuous and non-discrete nature of visual tokens. To tackle this problem, we draw inspiration from modular human cognition and propose a recursive, sparse mixture-of-experts framework integrated into conventional diffusion models. Our approach introduces a recursive component within joint attention layers that iteratively refines visual tokens over multiple latent steps while efficiently sharing parameters via sparse selection of neural modules. At each step, a gating network is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.