Jodi: Unification of Visual Generation and Understanding via Joint Modeling
Yifeng Xu, Zhenliang He, Meina Kan, Shiguang Shan, Xilin Chen

TL;DR
Jodi introduces a diffusion-based joint model that unifies visual generation and understanding, enabling simultaneous image and label generation, conditioned image creation, and multi-label prediction, supported by a new large-scale dataset.
Contribution
The paper presents Jodi, a novel diffusion transformer framework that unifies visual generation and understanding tasks, and introduces the large-scale Joint-1.6M dataset for training and evaluation.
Findings
Jodi outperforms existing models in both generation and understanding tasks.
Jodi demonstrates strong extensibility across various visual domains.
The Joint-1.6M dataset enables comprehensive training and evaluation.
Abstract
Visual generation and understanding are two deeply interconnected aspects of human intelligence, yet they have been traditionally treated as separate tasks in machine learning. In this paper, we propose Jodi, a diffusion framework that unifies visual generation and understanding by jointly modeling the image domain and multiple label domains. Specifically, Jodi is built upon a linear diffusion transformer along with a role switch mechanism, which enables it to perform three particular types of tasks: (1) joint generation, where the model simultaneously generates images and multiple labels; (2) controllable generation, where images are generated conditioned on any combination of labels; and (3) image perception, where multiple labels can be predicted at once from a given image. Furthermore, we present the Joint-1.6M dataset, which contains 200,000 high-quality images collected from…
Peer Reviews
Decision·Submitted to ICLR 2026
* The role switch mechanism: by randomly switching each domain between being generated, used as conditioning, or ignored, the model learns all the key distributions for both image generation and perception at once -- giving one network the flexibility to do many things. * Paper is well motivated, and generally well written and easy to follow. * The proposed method uses shared positional embeddings across domains (+ small role tags) so that the model knows which pixels align spatially, making
* “Extensibility to more domains” is claimed, but it’s unclear how much effort is needed for an entirely new modality. * Baselines for multimodal generation/understanding might not share the same supervision or label availability. This needs to be clarified in the paper. * Randomly switching roles during training might make optimization harder or convergence slower, especially as domains increase. * Even with linear attention, jointly modeling 8+ domains could still be memory-intensive and
1. Principled and Elegant Framework: The core idea of unifying $p(x|y)$ and $p(y|x)$ by modeling the joint distribution $p(x, y)$ is statistically elegant. The "Role-Switch" mechanism is a clever and direct implementation of this principle, forcing the model to learn a wide range of conditional and marginal distributions within one architecture. 2. The authors made smart architectural choices. The use of a linear diffusion transformer (Sana) correctly identifies and solves the $\mathcal{O}(M^2)$
The model's "unification" comes at a significant cost to specialist performance. Though perform better than omni-models on edge detection and normal estimation tasks, Jodi performs noticeably worse than SOTA specialist models on albedo estimation and depth estimation. For example, in depth estimation (Table 2), Jodi achieves 10.1 AbsRel on NYUv2, while the specialist Lotus-D achieves 5.1. In albedo estimation (Table 4), Jodi gets 15.5 PSNR, while the specialist RGB2X gets 20.6. The model excels
I like this paper due to the following strengths: (1) Flexible Control: The framework naturally supports complex, multi-modal conditioning, offering unparalleled flexibility for creative applications. (2) Enhanced Generalization: By forcing the model to simultaneously learn the generative process and the analytical structure, JODI is likely to develop a richer, more robust latent representation of visual concepts.
(1) Lack of T2I evaluation. As demonstrated in Fig. 11, the generated image is the bridge between text and other condisions. Therefore, I would like to see some comparison with T2I methods on GenEval or other T2I benchmarks. (2) The proposed method achieves condition to image by modeling the joint distribution of various conditions. However, some generation tasks (e.g., depth to image [a]) could be evaluated in a more precise manner. The authors should compare, or at least, discuss with such re
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsDiffusion
