Dolfin: Diffusion Layout Transformers without Autoencoder

Yilin Wang; Zeyuan Chen; Liangjun Zhong; Zheng Ding; Zhizhou Sha,; Zhuowen Tu

arXiv:2310.16305·cs.CV·October 26, 2023·1 cites

Dolfin: Diffusion Layout Transformers without Autoencoder

Yilin Wang, Zeyuan Chen, Liangjun Zhong, Zheng Ding, Zhizhou Sha,, Zhuowen Tu

PDF

Open Access 3 Reviews

TL;DR

Dolfin is a novel Transformer-based diffusion model for layout generation that improves modeling capability and reduces complexity, with applications in geometric structure modeling and enhanced performance on standard benchmarks.

Contribution

Introduces Dolfin, a diffusion layout transformer without autoencoder, featuring an autoregressive variant for capturing semantic object correlations, and demonstrates superior benchmark performance.

Findings

01

Significantly improves layout generation metrics

02

Effectively models geometric structures like line segments

03

Enhances transparency and interoperability

Abstract

In this paper, we introduce a novel generative model, Diffusion Layout Transformers without Autoencoder (Dolfin), which significantly improves the modeling capability with reduced complexity compared to existing methods. Dolfin employs a Transformer-based diffusion process to model layout generation. In addition to an efficient bi-directional (non-causal joint) sequence representation, we further propose an autoregressive diffusion model (Dolfin-AR) that is especially adept at capturing rich semantic correlations for the neighboring objects, such as alignment, size, and overlap. When evaluated against standard generative layout benchmarks, Dolfin notably improves performance across various metrics (fid, alignment, overlap, MaxIoU and DocSim scores), enhancing transparency and interoperability in the process. Moreover, Dolfin's applications extend beyond layout generation, making it…

Peer Reviews

Decision·ICLR 2024 Conference Withdrawn Submission

Reviewer 01Rating 3· reject, not good enoughConfidence 4

Strengths

The paper is detailed and easy to follow. Additional experiments on line segment generation can be useful to consider along with the other tasks.

Weaknesses

The paper offers potential value to the community. However, concerns regarding its novelty and the robustness of its experimental evaluations need to be addressed for it to be ready for publication. Novelty: The core proposition of the paper, which involves the utilization of the input coordinate space for layout design generation through continuous diffusion models, is not entirely novel. Similar approaches have been discussed in prior works such as [1, 2]. Experiments and Comparison: The exp

Reviewer 02Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

- not requiring the autoencoder layer in the diffusion model - autoregressive diffusion model to capture the rich semantic correlation between objects/items - experiment on generating geometric structures beyond layout, such as line segments

Weaknesses

- not using auto encoder is not a new idea, Imagen model is processing directly on pixels - there is no intuition on why auot-regressive design leads to better semantic correlation, although this is observed from experiments - not many baselines comparison for the line segment generation

Reviewer 03Rating 3· reject, not good enoughConfidence 4

Strengths

1. This paper is clearly written and easy to follow. 2. The proposed models notably improve quantitative results against generative layout benchmarks.

Weaknesses

1. The main difference with previous models is by operating directly on the input space of layouts (the coordinates and corresponding class labels) instead of processing the layouts with VAE/dedicated modules. However the reasons for the brought-in performance gains are not sufficiently justified. 2. "enhancing transparency and interoperability" is overclaimed since it is a property of the standard diffusion process itself. 3. From the paper presentation it is not clear what are the modificat

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Image Processing and 3D Reconstruction · Generative Adversarial Networks and Image Synthesis

MethodsDiffusion