# SUMMA: A Multimodal Large Language Model for Advertisement Summarization

**Authors:** Weitao Jia, Shuo Yin, Zhoufutu Wen, Han Wang, Zehui Dai, Kun Zhang, Zhenyu Li, Tao Zeng, Xiaohui Lv

arXiv: 2508.20582 · 2025-10-13

## TL;DR

SUMMA is a multimodal large language model designed to generate concise, commercially valuable summaries of video ads, improving ad comprehension, relevance ranking, and increasing advertising revenue on short video platforms.

## Contribution

The paper introduces SUMMA, a novel multimodal model that combines supervised fine-tuning and reinforcement learning to produce explainable ad summaries from video, frames, and transcripts, enhancing advertising systems.

## Key findings

- Online experiments show a 1.5% increase in advertising revenue.
- SUMMA effectively condenses multimodal ad content into valuable summaries.
- Integration of SUMMA improves candidate retrieval and relevance ranking.

## Abstract

Understanding multimodal video ads is crucial for improving query-ad matching and relevance ranking on short video platforms, enhancing advertising effectiveness and user experience. However, the effective utilization of multimodal information with high commercial value still largely constrained by reliance on highly compressed video embeddings-has long been inadequate. To address this, we propose SUMMA (the abbreviation of Summarizing MultiModal Ads), a multimodal model that automatically processes video ads into summaries highlighting the content of highest commercial value, thus improving their comprehension and ranking in Douyin search-advertising systems. SUMMA is developed via a two-stage training strategy-multimodal supervised fine-tuning followed by reinforcement learning with a mixed reward mechanism-on domain-specific data containing video frames and ASR/OCR transcripts, generating commercially valuable and explainable summaries. We integrate SUMMA-generated summaries into our production pipeline, directly enhancing the candidate retrieval and relevance ranking stages in real search-advertising systems. Both offline and online experiments show substantial improvements over baselines, with online results indicating a statistically significant 1.5% increase in advertising revenue. Our work establishes a novel paradigm for condensing multimodal information into representative texts, effectively aligning visual ad content with user query intent in retrieval and recommendation scenarios.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2508.20582/full.md

## Figures

8 figures with captions in the complete paper: https://tomesphere.com/paper/2508.20582/full.md

## References

75 references — full list in the complete paper: https://tomesphere.com/paper/2508.20582/full.md

---
Source: https://tomesphere.com/paper/2508.20582