JetFormer: An Autoregressive Generative Model of Raw Images and Text
Michael Tschannen, Andr\'e Susano Pinto, Alexander Kolesnikov

TL;DR
JetFormer is a unified autoregressive transformer model that directly generates and understands both images and text from raw data, eliminating the need for separate pretrained components and achieving competitive quality.
Contribution
It introduces JetFormer, a novel decoder-only transformer trained end-to-end on raw images and text, integrating a normalizing flow for image representation without relying on pretrained autoencoders.
Findings
Achieves text-to-image generation quality comparable to VQ-VAE and VAE baselines.
Demonstrates strong image understanding capabilities.
First model capable of high-fidelity image generation with strong log-likelihood bounds.
Abstract
Removing modeling constraints and unifying architectures across domains has been a key driver of the recent progress in training large multimodal models. However, most of these models still rely on many separately trained components such as modality-specific encoders and decoders. In this work, we further streamline joint generative modeling of images and text. We propose an autoregressive decoder-only transformer - JetFormer - which is trained to directly maximize the likelihood of raw data, without relying on any separately pretrained components, and can understand and generate both text and images. Specifically, we leverage a normalizing flow model to obtain a soft-token image representation that is jointly trained with an autoregressive multimodal transformer. The normalizing flow model serves as both an image encoder for perception tasks and an image decoder for image generation…
Peer Reviews
Decision·ICLR 2025 Poster
1. JetFormer processes text and images within a single autoregressive transformer, removing the need for modality-specific encoders and supporting seamless multimodal generation tasks. 2. JetFormer’s fully end-to-end training from raw data enables it to potentially learn task-specific representations, enhancing adaptability without relying on pre-trained embeddings.
1. The claim that pre-trained modality-specific encoders (e.g., VQ-VAE) limit performance due to task-agnostic design lacks sufficient quantitative or qualitative evidence to substantiate this limitation convincingly. 2. While end-to-end training in JetFormer may offer the advantage of learning task-specific latent representations, this approach could also introduce challenges in terms of training stability and computational cost. The paper does not provide a comparative analysis between end-to-
Originality: The paper introduces innovative techniques to enhance the quality of image generation models, particularly through the use of a novel noise curriculum and the factoring out of latent dimensions. The approach of factoring out redundant dimensions using a PCA-inspired variant before applying a flow model is a creative solution to improve model efficiency and performance. This originality is further demonstrated by the introduction of classifier-free guidance during sampling, which add
Model Performance and Design Choices: The paper highlights several design choices that impact model performance, such as the use of dropout and PCA transforms. It notes that modeling images after PCA leads to worse results and that omitting noise curriculum results in significantly poorer outcomes. These observations suggest that the model's performance is sensitive to specific preprocessing and design choices, which may limit its robustness and generalizability. Factoring Dimensions: The paper
1. The paper is well-writen and is easy to follow. 2. The proposed method is innovative, showing a promising approach towards end-to-end image-text joint probabilistic modeling. It shows that it is possible to directly optimize NLL for potentially any modality. 3. The paper identifies two crucial training techniques to help the model converge better, which are crucial for the method's effectiveness. 4. Experiments are conducted sufficiently. It also shows the proposed method can potentially bene
1. In figure 3 of the paper, it seems that the noise curriculum makes the final NLL get slightly higher than without using the noise curriculum, but experimentally, the model clearly benefits from the noise curriculum training technique. Does this indicate that NLL may not be the best objective for image-text generation task? A more theoretical explanation is helpful for the readers to understand this phenomenon. 2. The detailed architecture of the normalizing flow model is not clearly explained
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Processing and 3D Reconstruction
