Content-Aware Ad Banner Layout Generation with Two-Stage Chain-of-Thought in Vision Language Models
Kei Yoshitake, Kento Hosono, Ken Kobayashi, Kazuhide Nakata

TL;DR
This paper introduces a novel two-stage method using vision-language models to generate more effective and content-aware advertisement layouts by analyzing image content and producing detailed placement plans.
Contribution
The paper presents a new two-stage pipeline leveraging vision-language models for content-aware ad banner layout generation, improving over traditional saliency-based methods.
Findings
Outperforms existing layout methods in quality metrics
Produces more semantically coherent advertisement layouts
Effectively integrates image content analysis into layout design
Abstract
In this paper, we propose a method for generating layouts for image-based advertisements by leveraging a Vision-Language Model (VLM). Conventional advertisement layout techniques have predominantly relied on saliency mapping to detect salient regions within a background image, but such approaches often fail to fully account for the image's detailed composition and semantic content. To overcome this limitation, our method harnesses a VLM to recognize the products and other elements depicted in the background and to inform the placement of text and logos. The proposed layout-generation pipeline consists of two steps. In the first step, the VLM analyzes the image to identify object types and their spatial relationships, then produces a text-based "placement plan" based on this analysis. In the second step, that plan is rendered into the final layout by generating HTML-format code. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Visual Attention and Saliency Detection · Generative Adversarial Networks and Image Synthesis
