Efficient LLM-based Advertising via Model Compression and Parallel Verification
Wenxin Dong, Chang Gao, Guanghui Yu, Xuewu Jiao, Mingqing Hu, Qiang Fu, Peng Xu, Penghui Wei, Hui Xu, Yue Xing, Shuanglong Li, Lin Liu

TL;DR
This paper introduces an efficient framework for deploying large language models in advertising by combining model compression techniques and parallel verification to reduce latency and computational costs.
Contribution
It presents a novel framework integrating adaptive group quantization, hierarchical sparsification, and parallel verification for faster LLM inference in advertising.
Findings
Achieves significant speedup in LLM inference for advertising tasks.
Maintains acceptable quality levels despite compression and acceleration.
Demonstrates effectiveness on real-world advertising scenarios.
Abstract
Large language models (LLMs) have shown remarkable potential in advertising scenarios such as ad creative generation and targeted advertising. However, deploying LLMs in real-time advertising systems poses significant challenges due to their high inference latency and computational cost. In this paper, we propose an Efficient Generative Targeting framework that integrates adaptive group quantization, layer-adaptive hierarchical sparsification, and prefix-tree parallel verification to accelerate LLM inference while preserving generation quality. Extensive experiments on two real-world advertising scenarios demonstrate that our framework achieves significant speedup with acceptable quality degradation, making it operationally viable for practical deployments.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
