Dolfin: Diffusion Layout Transformers without Autoencoder
Yilin Wang, Zeyuan Chen, Liangjun Zhong, Zheng Ding, Zhizhou Sha,, Zhuowen Tu

TL;DR
Dolfin is a novel Transformer-based diffusion model for layout generation that improves modeling capability and reduces complexity, with applications in geometric structure modeling and enhanced performance on standard benchmarks.
Contribution
Introduces Dolfin, a diffusion layout transformer without autoencoder, featuring an autoregressive variant for capturing semantic object correlations, and demonstrates superior benchmark performance.
Findings
Significantly improves layout generation metrics
Effectively models geometric structures like line segments
Enhances transparency and interoperability
Abstract
In this paper, we introduce a novel generative model, Diffusion Layout Transformers without Autoencoder (Dolfin), which significantly improves the modeling capability with reduced complexity compared to existing methods. Dolfin employs a Transformer-based diffusion process to model layout generation. In addition to an efficient bi-directional (non-causal joint) sequence representation, we further propose an autoregressive diffusion model (Dolfin-AR) that is especially adept at capturing rich semantic correlations for the neighboring objects, such as alignment, size, and overlap. When evaluated against standard generative layout benchmarks, Dolfin notably improves performance across various metrics (fid, alignment, overlap, MaxIoU and DocSim scores), enhancing transparency and interoperability in the process. Moreover, Dolfin's applications extend beyond layout generation, making it…
Peer Reviews
Decision·ICLR 2024 Conference Withdrawn Submission
The paper is detailed and easy to follow. Additional experiments on line segment generation can be useful to consider along with the other tasks.
The paper offers potential value to the community. However, concerns regarding its novelty and the robustness of its experimental evaluations need to be addressed for it to be ready for publication. Novelty: The core proposition of the paper, which involves the utilization of the input coordinate space for layout design generation through continuous diffusion models, is not entirely novel. Similar approaches have been discussed in prior works such as [1, 2]. Experiments and Comparison: The exp
- not requiring the autoencoder layer in the diffusion model - autoregressive diffusion model to capture the rich semantic correlation between objects/items - experiment on generating geometric structures beyond layout, such as line segments
- not using auto encoder is not a new idea, Imagen model is processing directly on pixels - there is no intuition on why auot-regressive design leads to better semantic correlation, although this is observed from experiments - not many baselines comparison for the line segment generation
1. This paper is clearly written and easy to follow. 2. The proposed models notably improve quantitative results against generative layout benchmarks.
1. The main difference with previous models is by operating directly on the input space of layouts (the coordinates and corresponding class labels) instead of processing the layouts with VAE/dedicated modules. However the reasons for the brought-in performance gains are not sufficiently justified. 2. "enhancing transparency and interoperability" is overclaimed since it is a property of the standard diffusion process itself. 3. From the paper presentation it is not clear what are the modificat
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Image Processing and 3D Reconstruction · Generative Adversarial Networks and Image Synthesis
MethodsDiffusion
