CTR-Driven Advertising Image Generation with Multimodal Large Language   Models

Xingye Chen; Wei Feng; Zhenbang Du; Weizhen Wang; Yanyin Chen; Haohan; Wang; Linkai Liu; Yaoyu Li; Jinyuan Zhao; Yu Li; Zheng Zhang; Jingjing Lv,; Junjie Shen; Zhangang Lin; Jingping Shao; Yuanjie Shao; Xinge You; Changxin; Gao; Nong Sang

arXiv:2502.06823·cs.LG·February 13, 2025

CTR-Driven Advertising Image Generation with Multimodal Large Language Models

Xingye Chen, Wei Feng, Zhenbang Du, Weizhen Wang, Yanyin Chen, Haohan, Wang, Linkai Liu, Yaoyu Li, Jinyuan Zhao, Yu Li, Zheng Zhang, Jingjing Lv,, Junjie Shen, Zhangang Lin, Jingping Shao, Yuanjie Shao, Xinge You, Changxin, Gao, Nong Sang

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel approach using Multimodal Large Language Models optimized with reinforcement learning to generate advertising images that maximize click-through rates, outperforming existing methods in relevance and effectiveness.

Contribution

The authors develop a CTR-optimized image generation framework with a new reward model and product-centric strategy, advancing the use of MLLMs in advertising.

Findings

01

Achieves state-of-the-art CTR performance in online and offline tests.

02

Effectively aligns generated images with product features.

03

Demonstrates the benefit of reinforcement learning in multimodal image generation.

Abstract

In web data, advertising images are crucial for capturing user attention and improving advertising effectiveness. Most existing methods generate background for products primarily focus on the aesthetic quality, which may fail to achieve satisfactory online performance. To address this limitation, we explore the use of Multimodal Large Language Models (MLLMs) for generating advertising images by optimizing for Click-Through Rate (CTR) as the primary objective. Firstly, we build targeted pre-training tasks, and leverage a large-scale e-commerce multimodal dataset to equip MLLMs with initial capabilities for advertising image generation tasks. To further improve the CTR of generated images, we propose a novel reward model to fine-tune pre-trained MLLMs through Reinforcement Learning (RL), which can jointly utilize multimodal features and accurately reflect user click preferences.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

chenguoz/caig
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling

MethodsSoftmax · Attention Is All You Need · Focus