LaTtE-Flow: Layerwise Timestep-Expert Flow-based Transformer
Ying Shen, Zhiyang Xu, Jiuhai Chen, Shizhe Diao, Jiaxin Zhang, Yuguang Yao, Joy Rimchala, Ismini Lourentzou, Lifu Huang

TL;DR
LaTtE-Flow is a unified multimodal transformer architecture that enhances image understanding and generation efficiency, achieving faster inference and strong task performance by distributing flow processes across specialized layers.
Contribution
The paper introduces LaTtE-Flow, a novel architecture that combines pretrained VLMs with a layerwise flow-based design for efficient, fast multimodal image understanding and generation.
Findings
Achieves around 6x faster inference speed than recent models.
Maintains strong performance on multimodal understanding tasks.
Provides competitive image generation quality.
Abstract
Recent advances in multimodal foundation models unifying image understanding and generation have opened exciting avenues for tackling a wide range of vision-language tasks within a single framework. Despite progress, existing unified models typically require extensive pretraining and struggle to achieve the same level of performance compared to models dedicated to each task. Additionally, many of these models suffer from slow image generation speeds, limiting their practical deployment in real-time or resource-constrained settings. In this work, we propose Layerwise Timestep-Expert Flow-based Transformer (LaTtE-Flow), a novel and efficient architecture that unifies image understanding and generation within a single multimodal model. LaTtE-Flow builds upon powerful pretrained Vision-Language Models (VLMs) to inherit strong multimodal understanding capabilities, and extends them with a…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The paper is well written and easy to follow. The sampling efficiency of visual generation is a significant research question. 2. The solution of distributing timestep modelling across transformer layers is intuitive. And the proposed time-conditioned residual attention effectively incorporates cross-layer information, boosting convergence and overall performance. 3. Comprehensive studies on the design choices, such as expert groups and the effects of residual attention, are conducted in th
1. *Unclear Motivation:* As stated in the abstract, the paper studies unified multimodal models that struggle to achieve the same level of performance compared to specialist models. However, the paper only addresses the problem of sampling efficiency, which seems to have digressed from the core issue of unified models. 2. *Experiments Are Incomprehensive:* Although the paper is for unified multimodal models that include both text and image, the image generation of the model is only trained and
1. The paper presents a clear and well-motivated problem statement, effectively highlighting the efficiency–quality trade-off in unified multimodal generation and offering a logically coherent solution through a flow-based Transformer design. 2. The proposed Layerwise Timestep-Expert mechanism is both elegant and practical, significantly improving inference efficiency by activating only relevant Transformer layers at each timestep. 3. The integration of Timestep-Conditioned Residual Attention
1. The paper does not provide a direct comparison between the proposed LaTtE-Flow and the original VLM backbone on multimodal understanding tasks, leaving unclear how much the unified training or flow-based adaptation affects understanding performance. 2. The work lacks quantitative results on standard text-to-image generation benchmarks, which limits the evaluation of LaTtE-Flow’s true generative capability and generalization to open-ended visual synthesis. 3. Although the architecture introd
1. **The idea is interesting.** The idea of decouple multiple flow matching steps into multiple transformer blocks is interesting and results in good performance. 2. **Efficient.** LaTtE-Flow is very efficient, 6 times faster than Janus Pro. The author provides real running time to verify their claim. 3. **Effective.** LaTtE-Flow achieves low latency while keeping strong image understanding and generation performance.
1. **Compare and discuss with concurrent works.** Although using different data and model size, I suggest the author compare and discuss with newer Unified MLLM, including LMFusion, Blip3o and Bagel. 2. **Unification of generation and understanding.** LaTtE-Flow use different visual encoders and different sets of parameters for image understanding and generation. If the model first generates an image and then performs VQA based on the generated image, it requires two forward passes. I hope the
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning
