SUMMA: A Multimodal Large Language Model for Advertisement Summarization

Weitao Jia; Shuo Yin; Zhoufutu Wen; Han Wang; Zehui Dai; Kun Zhang; Zhenyu Li; Tao Zeng; Xiaohui Lv

arXiv:2508.20582·cs.IR·October 13, 2025

SUMMA: A Multimodal Large Language Model for Advertisement Summarization

Weitao Jia, Shuo Yin, Zhoufutu Wen, Han Wang, Zehui Dai, Kun Zhang, Zhenyu Li, Tao Zeng, Xiaohui Lv

PDF

Open Access

TL;DR

SUMMA is a multimodal large language model designed to generate concise, commercially valuable summaries of video ads, improving ad comprehension, relevance ranking, and increasing advertising revenue on short video platforms.

Contribution

The paper introduces SUMMA, a novel multimodal model that combines supervised fine-tuning and reinforcement learning to produce explainable ad summaries from video, frames, and transcripts, enhancing advertising systems.

Findings

01

Online experiments show a 1.5% increase in advertising revenue.

02

SUMMA effectively condenses multimodal ad content into valuable summaries.

03

Integration of SUMMA improves candidate retrieval and relevance ranking.

Abstract

Understanding multimodal video ads is crucial for improving query-ad matching and relevance ranking on short video platforms, enhancing advertising effectiveness and user experience. However, the effective utilization of multimodal information with high commercial value still largely constrained by reliance on highly compressed video embeddings-has long been inadequate. To address this, we propose SUMMA (the abbreviation of Summarizing MultiModal Ads), a multimodal model that automatically processes video ads into summaries highlighting the content of highest commercial value, thus improving their comprehension and ranking in Douyin search-advertising systems. SUMMA is developed via a two-stage training strategy-multimodal supervised fine-tuning followed by reinforcement learning with a mixed reward mechanism-on domain-specific data containing video frames and ASR/OCR transcripts,…

Tables8

Table 1. Table 1. Statistics of the three constructed datasets (AdSum-Doubao for SFT, AdSum-Human for RL, AdSum-Test for evaluation

Dataset Name	Annotated	Size
AdSum-Doubao	LLM	400K
AdSum-Human	Human	100K
AdSum-Test	Human	2.5K

Table 2. Table 2. The performances of different models on the AdSum-Test across various evaluation metrics. The best result for each metric is boldfaced.

Method	BLEU	CIDEr	ROUGE	RewardSem
Qwen2-VL-2B-Instruct	0.10	0.156	0.26	0.09
Qwen2-VL-7B-Instruct	0.14	0.18	0.29	0.12
SUMMA-SFT	0.36	1.01	0.46	0.26
Single-Stage SFT	0.38	1.05	0.48	0.28
Two-Stage SFT	0.41	1.12	0.52	0.30
SUMMA-RL	0.48	1.42	0.62	0.38

Table 3. Table 3. The synergistic effects between different modalities. Multimodal (video + OCR/ASR) input outperforms either single modality

Method	BLEU	CIDEr	ROUGE	RewardSem
Unimodal-OCR/ASR	0.27	0.64	0.42	0.17
Unimodal-Video	0.34	0.82	0.48	0.21
Multimodal	0.48	1.42	0.62	0.38

Table 4. Table 4. Performance of SUMMA-RL on the AdSum-Test set, which we split into two subsets according to ASR/OCR information density: Sparse vs. Rich .

Test Set	BLEU	CIDEr	ROUGE	RewardSem
Sparse ASR/OCR	0.28	0.65	0.46	0.19
Rich ASR/OCR	0.52	1.53	0.69	0.45
AdSum-Test	0.48	1.42	0.62	0.38

Table 5. Table 5. Ablation study of reward design: lexical-only, semantic-only, and mixed (RewardLex + RewardSem).

Method	BLEU	CIDEr	ROUGE	RewardSem
RewardLex	0.44	1.24	0.57	0.31
RewardSem	0.38	1.11	0.49	0.37
RewardLex + RewardSem	0.48	1.42	0.62	0.38

Table 6. Table 6. Performance comparison between our SUMMA and the embedding method for the downstream retrieval task.

Method

Diversity

Ratio

Granularity

Ratio

Hit@10

Hit@100

Multimodal

Embedding

7.22%

1.22%

4.6%

18%

SUMMA

10.23%

3.22%

6.5%

22%

Table 7. Table 7. Performance comparison between our SUMMA and the embedding method for the downstream relevance ranking task.

Method	AUC	[email protected]
Multimodal Embedding	0.90	0.65
SUMMA	0.96	0.71

Table 8. Table 8. Online A/B-test gains of SUMMA: irrelevant-case ratio (↓), conversion ratio (↑), and ad revenue (↑).

Task

Irrelevant Ratio

𝚫

(

↓

)

Conversion Ratio

𝚫

(

↑

)

Ad Revenue

𝚫

(

↑

)

Retrieval

-1.0%

+0.9%

+0.5%

Relevance Ranking

-4.0%

+1.5%

+1.0%

Equations11

L_{Domain-FT} = - E_{(x_{v}, x_{aux}, y) \sim D_{SFT}} [lo g π (y ∣ x_{v}, x_{aux}; θ)],

L_{Domain-FT} = - E_{(x_{v}, x_{aux}, y) \sim D_{SFT}} [lo g π (y ∣ x_{v}, x_{aux}; θ)],

L_{GRPO} = E_{q, o^{(i)}} [

L_{GRPO} = E_{q, o^{(i)}} [

\hat{A}^{(i)}

ρ^{(i)}

D_{KL}

LP (o, y) = min (1, \frac{∣ y ∣}{∣ o ∣})^{γ},

LP (o, y) = min (1, \frac{∣ y ∣}{∣ o ∣})^{γ},

R (o, y) = BLEU (o, y) + LP (o, y) \cdot LLM_Score (o, y)

R (o, y) = BLEU (o, y) + LP (o, y) \cdot LLM_Score (o, y)

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputational and Text Analysis Methods · Sentiment Analysis and Opinion Mining · Web Data Mining and Analysis

Full text

SUMMA: A Multimodal Large Language Model for Advertisement Summarization

Weitao Jia

[email protected]

ByteDance SearchAdsBeijingChina

,

Shuo Yin

[email protected]

ByteDance SearchAdsBeijingChina

,

Zhoufutu Wen

[email protected]

ByteDanceBeijingChina

,

Han Wang

[email protected]

ByteDanceBeijingChina

,

Zehui Dai

[email protected]

ByteDance SearchAdsBeijingChina

,

Kun Zhang

[email protected]

ByteDance SearchAdsBeijingChina

,

Zhenyu Li

[email protected]

ByteDance SearchAdsBeijingChina

,

Tao Zeng

[email protected]

ByteDance SearchAdsBeijingChina

and

Xiaohui Lv

[email protected]

ByteDance SearchAdsBeijingChina

(2025)

Abstract.

Understanding multimodal video ads is crucial for improving query-ad matching and relevance ranking on short video platforms, enhancing advertising effectiveness and user experience. However, the effective utilization of multimodal information with high commercial value still largely constrained by reliance on highly compressed video embeddings—has long been inadequate. To address this, we propose SUMMA (the abbreviation of SUmmarizing MultiModal Ads), a multimodal model that automatically processes video ads into summaries highlighting the content of highest commercial value, thus improving their comprehension and ranking in our search-advertising systems. SUMMA is developed via a two-stage training strategy—multimodal supervised fine-tuning followed by reinforcement learning with a mixed reward mechanism—on domain-specific data containing video frames and ASR/OCR transcripts, generating commercially valuable and explainable summaries. We integrate SUMMA-generated summaries into our production pipeline, directly enhancing the candidate retrieval and relevance ranking stages in real search-advertising systems. Both offline and online experiments show substantial improvements over baselines, with online results indicating a statistically significant 1.5% increase in advertising revenue. Our work establishes a novel paradigm for condensing multimodal information into representative texts, effectively aligning visual ad content with user query intent in retrieval and recommendation scenarios.

Multimodal Large Language Models, Search Advertising, Information Retrieval Systems, Advertisement Summarization

††journalyear: 2025††copyright: acmlicensed††conference: Proceedings of the 34th ACM International Conference on Information and Knowledge Management; November 10–14, 2025; Seoul, Republic of Korea††booktitle: Proceedings of the 34th ACM International Conference on Information and Knowledge Management (CIKM ’25), November 10–14, 2025, Seoul, Republic of Korea††doi: 10.1145/3746252.3761393††isbn: 979-8-4007-2040-6/2025/11††ccs: Information systems Information systems applications††ccs: Information systems Information retrieval††ccs: Computing methodologies Natural language processing

1. Introduction

Short video platforms (e.g., Douyin, TikTok, Instagram Reels, and Kuaishou) significantly change how billions of people consume digital content every day. In this environment, search advertising becomes essential because video ads need to match user search intent accurately. Understanding multimodal content to identify valuable commercial information in ads is critical for effective advertising. However, current methods face challenges, especially in clearly defining what information has high commercial value—such as recognizing advertised products, brands, and key selling points—and efficiently extracting this information from videos.

Vision–text contrastive learning has surged in recent years, catalysing multimodal fusion research and greatly enhancing neural models’ capacity to predict over real-world visual–textual information. (Radford et al., 2021; Jia et al., 2021; Li et al., 2022b, 2021) show that large-scale and generic vision–language pre-training delivers strong cross-modal alignment, yet this paradigm largely ignores the domain-specific semantics, for business-oriented scenarios such as search advertising. (Gan et al., 2025; Dong et al., 2024; Ye et al., 2023; Wen et al., 2024; Yang et al., 2024; Wen et al., 2023) adapt these foundations to composed-image search, video–text retrieval and ad ranking, but the lack of interpretability and the essence of coarse global embeddings make them fail in fine-grained visual-textual correlations, thus degrading commercial reliability. Recently, significant progress has been achieved in Large Language Models (LLMs) (Ouyang et al., 2022; Touvron et al., 2023; Qwen et al., 2025; Yang et al., 2025; GLM et al., 2024; Cai et al., 2024; DeepSeek-AI, 2024), driving the substantial development of sophisticated Multimodal LLMs (MLLMs) (OpenAI, 2024a; AI, 2024; Li et al., 2023; Liu et al., 2023b; Zhu et al., 2023; Bai et al., 2025; Chen et al., 2024b; Wu et al., 2024; Wang et al., 2025b, 2024d; Nie et al., [n. d.]; Wang et al., 2024c), where modality alignment training among distinct modules leads to superior performance across diverse multimodal tasks, like visual question answering (Liu et al., 2024a; Chen et al., 2024a), document understanding (OCR) (Mathew et al., 2021; Masry et al., 2022; Wang et al., [n. d.]), and some vision-centric tasks (Li et al., 2022c; Paiss et al., 2023). Despite their remarkable success, MLLMs remain underexplored in retrieval and recommendation systems.

In this paper, we propose SUMMA (SUmmarizing MultiModal Ads), a specialized MLLM for search advertising applications. Initially, we perform supervised fine-tuning (SFT) from Qwen2-VL-2B-Instruct on an extensive collection of advertising short video summarization data (sourced from our short video platform). To ensure the quality of our SFT data, we implement a rule-based filtering approach leveraging LLM-as-a-judge. This methodology employs dual criteria: assessing the intrinsic quality of video summaries and evaluating their potential to enhance performance in downstream relevance tasks. By concatenating advertisement videos and their corresponding OCR/ASR texts as the input, the yielded SUMMA-SFT model exhibits a strong basic capacity for summarizing multimodal advertisements. Subsequently, we further enhance SUMMA-SFT through reinforcement learning using high-quality meticulously annotated data. Our RL reward design draws inspiration from (Feng et al., 2025a) and (Chang et al., 2025), where the model receives encouragement or penalty signals based on various metrics with a reference (e.g., BLEU and external LLM judgment), depending on the generated outputs and human-annotated ground truth responses.

The resulting model, called SUMMA-RL, demonstrates a significant improvement over baselines by generating concise yet informative summaries that efficiently process multimodal ad content into text of high commercial value, improving interpretability and staying friendly to latency-sensitive downstream advertising tasks. Although reminiscent of image/video captioning, our “visual ad summarization” task uniquely produces representative textual content tailored to search-system ads, using a professional advertising lexicon to spotlight core semantic elements such as brands, selling points, and target audiences.

Extensive experimentation shows that combining video frames with OCR/ASR transcripts markedly outperforms any single-modality approach, and the margin widens as the textual stream becomes richer. Incorporating a mixed (joint lexical-and-semantic reward) reward mechanism further steers the model toward summaries that are simultaneously precise and meaning-preserving. These higher-quality summaries, in turn, yield clear downstream gains by raising retrieval hit rates and improving ranking AUC. Overall, the key contributions of our work are as follows:

•

We propose a data construction pipeline for the task of ad video summarization, where the generated video summary contains high commercial value by distilling key persuasive elements.

•

We propose an efficient two-stage training paradigm, i.e., multimodal supervised fine-tuning followed by reinforcement learning, and experiments demonstrate that this scheme yields the best summarization quality and downstream performance.

•

We establish a search advertising framework taking SUMMA as the core multimodal comprehension component. The generated ad summaries enhance the downstream tasks–retrieval and relevance ranking–by improving query-ad alignment. Unlike methods relying solely on multimodal embeddings, our approach can better capture video semantics and key patterns, thus ultimately elevate the overall search advertising performance to a high level.

•

Both offline evaluation and online deployment in production environments consistently demonstrate the efficacy of the proposed SUMMA and its associated advertising architecture. The online experimental results demonstrate that our SUMMA totally yields a significant 1.5% enhancement in advertising revenue. Human evaluation further reveals a 5% reduction in relevance bad case rate.

2. Related Work

2.1. Multimodal Approaches for Retrieval and Relevance

Initial research on retrieval or recommendation systems mainly concentrate on unimodal textual analysis, thus limited in performance(Chang et al., 2021; Liu et al., 2021; Yao et al., 2021; Zou et al., 2021). Fortunately, the evolution of cross-modal understanding has been significantly propelled by vision-language pretraining frameworks (e.g., contrastive learning based CLIP (Radford et al., 2021), as a seminal work in this field, ALIGN (Jia et al., 2021), and DeCLIP (Li et al., 2022b)), which bridge the semantic gap between visual and textual domains through large-scale corpus training. These models demonstrate remarkable transfer learning capabilities across diverse multimodal tasks. To align different modalities, Xu et al. propose ARMMT (Xu et al., 2024) to use cross-attention taking image and text embeddings as input, and output highly-fused multimodal representations. Similarly, based on cross attention, Gan et al. (Gan et al., 2025) design a pseudo-query-video matching task to further improve the modality aligning ability of their two-tower model. For better processing the video information, Dong et al. (Dong et al., 2024) leverage temporal modeling modules (STAN (Liu et al., 2023a)) in their zero-shot video-text retrieval pipeline and achieve notable effectiveness. Consistent with human preference, Guo et al. (Guo et al., 2024) utilize Proximal Policy Optimization (PPO (Schulman et al., 2017)) to discern partial order relations among labels in multimodal label relevance ranking task. Recently, Zhang et al. (Zhang et al., 2025) have proposed NoteLLM-2 which uses a multimodal understanding LLM to embed vision-text-mixed content into compressed representations. In contrast, our SUMMA opts to align with the pretrained pattern of MLLMs, generating high-quality texts being leveraged directly and naturally by the downstream retrieval and relevance ranking modules.

Considering the latency requirements for online services, the deployed downstream models need to be small-scale, and their inputs should also have limited length. However, directly using multimodal ad embeddings and user query text as inputs would impose an additional burden of multimodal fusion on such downstream task modules, which are inherently less capable due to their limited parameter amounts. On the contrary, the textual outputs from our SUMMA can be seamlessly integrated with downstream text-based similarity/relevance computation tasks, effectively alleviating the training burden and knowledge memorizing pressure on subsequent single-modal modules. Consequently, SUMMA allows downstream modules to focus exclusively on optimizing the alignment between user intent and advertising content.

2.2. Multimodal Large Language Models

Recent years have witnessed a surge in attention towards MLLMs, with their cross-modal understanding capabilities becoming a focal point in artificial intelligence research. Pioneering works like BLIP (Li et al., 2022a), Flamingo (Alayrac et al., 2022) and BLIP-2 (Li et al., 2023) establish foundational architectures by integrating cross-modal attention mechanisms, demonstrating exceptional performance across diverse multimodal benchmarks. Following this trend, the community has focused on enhancing vision-language alignment through instruction training, exemplified by LLaVA (Liu et al., 2023b), MiniGPT-4 (Zhu et al., 2023) and InstructBLIP (Dai et al., 2023), which leverage large-scale image-text-mixed annotated/synthesized data to refine visual dialogue abilities. Among them, MiniGPT-4 and LLaVA have demonstrated that even simple modality fusion modules (like a linear projector) can achieve effective cross-modal alignment. Building upon this foundation, state-of-the-art frameworks such as LLaVA-NeXT (Liu et al., 2024b), Qwen-VL series (Bai et al., 2023; Wang et al., 2024a; Bai et al., 2025), Intern-VL series (Chen et al., 2024d, c, b), and DeepSeek-VL series (Lu et al., 2024; Wu et al., 2024) support higher-resolution visual inputs for fine-grained visual cognition, via integrating enhanced visual encoders, image partitioning strategies, or sophisticated token compression algorithms. Moreover, these works also establish diverse and effective training paradigms to handle different intricate multimodal scenarios. Recently, a growing number of studies have focused on MLLMs for video understanding tasks (Zhang et al., 2023; Li et al., 2024; Maaz et al., 2024; Wang et al., 2022a, 2024b, 2025a, b, 2024c; Nie et al., [n. d.]; Wang et al., 2024d), with particular emphasis on enhancing their spatiotemporal representation capabilities. Despite its great success, research on MLLMs remains underdeveloped in the domain of search advertising, resulting in constrained model capacities that fail to reach their full potential in this scenario.

Our proposed SUMMA, designed for video ad understanding, has undergone domain-specific fine-tuning and reinforcement learning on large-scale ad summarization datasets, where the OCR/ASR textual features supplementarily help enhance the overall performance. This specialized training pipeline enables effective deployment in search advertising downstream services such as retrieval and relevance ranking.

2.3. Reinforcement Learning for MLLMs

The recent OpenAI-o1 (OpenAI, 2024b) and DeepSeek-R1 (DeepSeek-AI, 2025) have catalyzed more and more interest within the open-source community in the enhancement of LLM reasoning capabilities through reinforcement learning (RL) training. Similarly, they have also stimulated research of applying RL to MLLMs. Among them, Visual-RFT (Liu et al., 2025) outperforms conventional visual instruction tuning methods in computer vision tasks via an innovative reward mechanism depending on IoU or classification correction. Through Group Relative Policy Optimization (GRPO (Shao et al., 2024)), Visual-RFT demonstrates enhanced capabilities in object detection, visual grounding, and image recognition while requiring substantially less training data than SFT. To boost RL policy performance, Skywork R1V (Peng et al., 2025) firstly undergoes a cyclical error correction stage, where the model iteratively learns from previously mispredicted data instances. This curriculum learning paradigm, characterized by persistent refinement through challenging samples, demonstrates statistically significant improvements for efficacy of the subsequent reinforcement learning. Moreover, Vision-R1 (Huang et al., 2025) adopts a phased RL approach of progressively increasing generation length constraints to enhance the improvements of their models. In contrast, Kimi-VL (Team et al., 2025) implements a length penalty mechanism within its reward design, strategically preventing the model from producing excessively verbose responses. Specially devised for video reasoning tasks, Video-R1 (Feng et al., 2025b) significantly enhances temporal reasoning capabilities by incorporating sequential temporal information into its reward mechanism design. This integrates the temporal dependency as a critical dimension during the decision-making process.

Similar to most of the aforementioned MLLM RL methods, our SUMMA also utilizes verifiable rewards, determined by the generated responses with reference to the ground truth answers. As for the differences, our task of interest is ad visual content summarization, and thus the suitable reward design is based on the lexical as well as semantic metrics against human-annotated references, inspired by (Feng et al., 2025a; Chang et al., 2025).

3. Methodology

In this section, we present SUMMA, an MLLM that generates representative, concise, and commercially oriented summaries by integrating advertising videos with their corresponding OCR/ASR transcripts. As a roadmap for what follows, Figure 2 illustrates the overall architecture of SUMMA. Building on the Qwen2-VL, the framework jointly encodes videos as well as the corresponding OCR/ASR transcripts, then decodes them into advertising-style highlights that emphasize product brands. To optimize this ability, we adopt a two-stage pipeline.

Supervised Fine-Tuning (SFT) Stage:We perform data collection and processing by gathering advertising videos together with their OCR/ASR transcripts. Employing cold-start multimodal supervised fine-tuning with data generated and cleaned by Doubao-1.5-pro (ByteDance, 2025) (hereafter abbreviated as Doubao).
Reinforcement Learning (RL) Stage: Utilizing GRPO combined with a mixed-reward mechanism to further align the model with human preferences. The remainder of this section details each component of the pipeline in turn.

3.1. Data Curation

Leveraging our in-house search-ad video data, we curate three complementary resources—AdSum-Doubao (LLM-bootstrapped summaries for supervised fine-tuning, SFT), AdSum-Human (expert-annotated summaries for reinforcement learning, RL), and AdSum-Test (a held-out set reserved for final evaluation). The construction of each dataset is detailed in the following sections.

AdSum-Doubao

Guided by the annotation scheme in Figure 3, we prompt the Doubao LLM to produce $\leq$ 40-word synopses that highlight the ad subject, brand, and other commercially salient cues. Concretely, OCR/ASR transcripts extracted from each advertising video—together with basic metadata—are concatenated into the textual input to Doubao, following a modality-flattening strategy similar to LLaVA (Liu et al., 2023b). For every video we sample several candidate summaries by varying the temperature; a two-stage, Doubao-based automatic verifier then filters and ranks these outputs to yield the final dataset. (1) Linguistic Judgment. Using the validation criteria specified in Figure 3, we select the highest-quality summary from multiple candidates. The prompt we utilize is shown in Figure 5. (2) Relevance Judgment. Each sample of our relevance data includes a user query, a video and a human-annotated label (0/1). We feed the query and the relevance label, along with the video’s synthesized summary, into Doubao for verification. This process determines whether each summary adequately reflects the relevance judgments (human feedback) of the corresponding query and ad. The LLM outputs a confidence score, only the summary with the highest score is preserved.

Consequently, only the summaries that pass both the linguistic-quality screening and the relevance filter are kept, leaving us with a clean and high-quality dataset for training.

AdSum-Human

Domain experts first follow the protocol in Figure 3 to condense every ad video into a $\leq$ 40-word synopsis that foregrounds the ad subject, brand, and distinctive selling or pain points. The resulting 100K expert summaries then serve as the reward corpus in the RL stage, providing the high-confidence signals needed for stable policy optimisation.

AdSum-Test

To evaluate the ad video summarization capabilities of our SUMMA, we collect 2.5k samples from our platform for manual annotation using the aforementioned methodology.

3.2. Domain-Specific Multimodal Fine-Tuning

The initial fine-tuning phase imbues the model with domain-specific knowledge and processing capabilities. Formally, we optimize the following objective:

[TABLE]

where $\mathbf{x_{v}},\mathbf{x_{\text{aux}}},\mathbf{y}$ are the video, auxiliary text (OCR/ASR) and summary, $\mathcal{D}_{\text{SFT}}$ is the AdSum-Doubao dataset, and $\pi(\cdot;\bm{\theta})$ the multimodal transformer.

Upon completion of this domain-specific fine-tuning stage, we obtain the intermediate model SUMMA-SFT, which serves as the foundation for the subsequent RL stage.

3.3. Mixed-Reward Reinforcement Fine-Tuning

SUMMA-SFT effectively localizes key frames and extracts representative textual content tailored for ad recommendations. However, this competence can be further improved through RL. Specifically, we employ GRPO to train our SUMMA-RL. For each query $\bm{q}=(\mathbf{x_{v}},\mathbf{x_{\text{aux}}})\sim\mathcal{D}_{RL}$ , $N$ rollout responses are generated as a group, i.e., $\bm{o}^{(i)}\sim\pi_{\text{old}}(\bm{o}^{(i)}|\bm{q})$ where $i=1,2,...,N$ , with corresponding rewards assigned are $\{r^{(1)},r^{(2)},\dots,r^{(N)}\}$ . The optimization objective is defined as:

[TABLE]

As the cornerstone of our reinforcement learning, we leverage a mixed-reward mechanism that differs from DeepSeek-R1-like works. Contrary to the binary (0/1) reward design which specially evaluate mathematical reasoning task performance based on the final answer correctness, our visual ad summarization task adopts human-annotated reference summaries as the supervision signals. Specifically, with the ground truth summaries, we employ lexical coverage and semantic congruence assessment to encourage or punish the policy’s generating responses.

Lexical Reward (RewardLex)

In neural machine translation, metrics like BLEU (Papineni et al., 2002) are typically used to evaluate the quality of model outputs by referencing human-annotated reference translations as the ground truth. Following this way, we also take BLEU as our lexical reward score function, for encouraging the model’s generation to cover key advertising elements, including product subjects, merchant brands, and selling points.

Semantic Reward (RewardSem)

To maintain the semantic integrity of golden (human-annotated) ad video summaries while allowing for paraphrasing, we utilize LLM as a judge. Specifically, we provide the LLM (Doubao) with both the human-annotated reference summary and the model-generated summary, prompting it to assess whether the latter sufficiently covers all key points from the reference. Besides, we also focus on generating concise summaries to enhance clarity and usability. The prompt we use is similar to Figure 5 but is additionally provided a human-written summary for reference.

Length Penalty

Moreover, to achieve low-latency requirements for online services and prevent the generated summaries from becoming excessively lengthy during reinforcement learning training, we introduce a penalty term in the semantic reward to constrain the length of generated summaries, as formalized in the following equation:

[TABLE]

where LP means “Length Penalty”, $|\bm{y}|$ is the length of human-written reference summary $\bm{y}$ , $|\bm{o}|$ denotes the length of model rollout response $\bm{o}$ , and $\gamma\geq 0$ is a hyperparameter controlling the severity of the imposed penalty.

Bringing together all the aforementioned parts, our final mixed-reward function is:

[TABLE]

Through this mixed-reward mechanism, we can leverage the verifiably correct references to encourage the policy to generate high-quality outputs - particularly advertising content summaries in our scenario - while simultaneously incentivizing the emergence of semantically similar yet more concise summarization patterns.

3.4. Online Serving

Given the stringent inference latency requirements in search advertising systems, downstream services typically employ lightweight models with limited parameters. As is shown in Figure 6, we address this via SUMMA, generating textual summaries that are statically cached in Redis, and thus downstream models can directly fetch these precomputed summaries for inference tasks. This approach effectively alleviates the multimodal modeling burden on downstream modules while maintaining low inference latency. By decoupling multimodal processing from task-specific learning, the downstream modules can fully concentrate on their dedicated objectives, thereby achieving enhanced performance. Two representative search advertising downstream tasks are as follows:

(1) The Retrieval stage constitutes the initial phase in search advertising systems, tasked with filtering candidate ad sets from massive advertising inventories. These selected candidates are subsequently forwarded to downstream tasks, including relevance computation and click rate prediction. Our SUMMA framework couples with recall modules by directly utilizing SUMMA-RL inference results, i.e., summaries, as advertisement-side features, which engage in similarity calculations with user-side search queries.

(2) As for the Relevance Ranking stage, it receives coarsely-ranked advertisement candidates from upstream filtration processes, and computes relevance scores between user queries and advertisements. thereafter, it filters out irrelevant cases. To enhance the precision of relevance computation, we consistently employ the output features generated by SUMMA-RL as critical advertising attributes.

4. Experiments

In this section, we first delineate the evaluation metrics and experimental configurations that form the foundation of our investigation. Subsequently, comprehensive analyses of both controlled-environment (offline) experiments and real-world deployment (online) trials are provided, accompanied by their respective empirical outcomes.

4.1. Evaluation Metrics

We employ the following metrics for offline and online evaluations.

Offline Metrics

The offline evaluation primarily aims to provide a reliable and multi-faceted benchmark by measuring summary quality, relevant-ad recall, and fine-grained ranking discrimination, thereby establishing a solid basis for subsequent online experiments.

•

BLEU: N -gram precision metric evaluating how closely the generated summary matches the reference text.

•

ROUGE: N -gram or longest-common-subsequence recall indicating how much of the reference content is covered by the summary.

•

CIDEr: TF–IDF-weighted n-gram similarity capturing consensus between the summary and multiple references.

•

RewardSem: Cosine similarity of sentence embeddings quantifying semantic closeness between generated and reference texts.

•

Hit@k: For each query, the fraction of ground-truth ads appearing in the top-k retrieved results.

•

ROC–AUC: Area under the ROC curve representing the probability that a random positive is ranked above a random negative.

•

Recall@Precision90: Recall attained while keeping precision at or above the fixed precision 90 %.

Online Metrics

We conduct online A/B testing to compare SUMMA with the baseline models. Three metrics are considered, as follows:

•

Irrelevant Ratio: The proportion of cases within a sampled dataset that are labeled as “Bad” by human annotators, calculated as $\frac{\#\text{Bad}}{\#\text{Bad}+\#\text{Good}}$ .

•

Conversion Ratio: An advertising conversion occurs when a potential customer views a video ad and subsequently takes an action deemed valuable to the advertiser’s business, such as making an online purchase or calling the business from a mobile phone. We define Conversion Ratio as $\frac{\#\text{Conversion}}{\#\text{Click}}$ .

•

Ad Revenue: The revenue of the advertising system after distinct methods engaging and assisting advertisers in achieving conversions.

4.2. Experiment Setup

The number of frames extracted from a short video is set to 16, which is determined by our preliminary experiments to strike an effective balance between performance and efficiency. During both the SFT and RL stages, we conduct training on one machine comprising 8 NVIDIA A100 GPUs (80GB). The training framework utilized is verl (Sheng et al., 2024) based on PyTorch.

For SFT, we take Qwen2-VL as our base model to train SUMMA-SFT, with a learning rate of 5e-7. For learning rate scheduler, we adopt a cosine decay strategy and a linear warm-up of 2000 steps. We use the AdamW optimizer with $\beta_{1}$ = 0.9, $\beta_{2}$ = 0.999, and a weight decay rate of 0.02.

For GRPO, SUMMA-SFT is used to initialized the policy. The hyperparameters are configured as follows: 8 rollouts are sampled per prompt, while PPO optimization utilizes a batch size of 64. The actor learning rate is 1e-6, and we employ KL coefficient $\beta=0.001$ to stabilize the training process.

4.3. Offline Experiments

Our offline experiments are designed to address the following research questions (RQ):

RQ1. In the context of search advertising summarization task, how to make full use of the multimodal ad content to optimize the MLLMs for generating semantically and commercially impactful ad summaries?

RQ2. How much performance gain can multimodal information bring compared to unimodal approaches (only ocr and asr or frames)?

RQ3. The impact of the richness of OCR/ASR information?

RQ4. How do different reward designs affect model performance?

RQ5. Compared to the multimodal embedding methods, what improvements can our approach bring to downstream tasks?

SFT → RL is the most effective training pipeline (RQ1)

To evaluate the summary quality generated from our SUMMA-RL, we compare it with some baseline models on AdSum-Test (see Section 3.1) across four key metrics: BLEU, CIDEr, ROUGE, and RSemantic (see Section 4.1). All involved models participating in the comparison consist of:

•

Qwen2-VL different variants (2B and 7B Instruct versions)

•

Our SUMMA-SFT model, which is only trained via SFT on AdSum-Doubao

•

A separate two-stage SFT model, which has undergone SFT on AdSum-Doubao and then SFT on AdSum-Human

•

A merging single-stage SFT model, directly fine-tuned on the combined dataset of AdSum-Doubao and AdSum-Human

•

Our SUMMA-RL is derived from SFT on AdSum-Doubao, followed by RL on AdSum-Human.

As shown in Table 2, our method demonstrates consistent improvements across all evaluation metrics. Even our intermediary SUMMA-SFT model can outperform both Qwen2-VL models by significant margins (e.g., 0.36 vs 0.14 in BLEU, and 1.01 vs 0.18 in CIDEr). This performance gap highlights the effectiveness of our constructed datasets, which infuse domain-specific knowledge into the model and thus enhance its capability in search advertising scenarios. Moreover, compared to the other training strategies (two-stage SFT or mixed 1-stage SFT), “SFT then RL” is found to be the best one, providing additional evidence that the training paradigm also remains effective when applied to the search advertising domain, and thus demonstrating its broader applicability. Figure 8 is a case study showing the superiority of the summary generated from SUMMA over the base model.

Multimodal input clearly outperforms any single modality (RQ2)

To evaluate the efficacy of multimodal integration, we conduct a controlled ablation study comparing three different training configurations: unimodal (video-only), unimodal (OCR/ASR-only) and multimodal (video+OCR/ASR). As evidenced in Table 3, combining different modalities, our multimodal approach demonstrates statistically significant improvements across all evaluation metrics. Specifically, compared to the best unimodal baseline, we observe relative improvements of +0.14 (BLEU), +0.60 (CIDEr), +0.14 (ROUGE), and +0.17 (RewardSem), underscoring the complementary nature of visual and textual modalities. Therefore, the integration of linguistic signals from ASR/OCR with visual features can provide significant advantages for comprehensive video understanding and summarization by capturing rich semantic context than any single modality alone.

More OCR/ASR text brings larger gains (RQ3)

We collect two other distinct datasets, each comprising 2,500 samples, categorized by the richness of OCR/ASR textual information. This stratification is designed to evaluate how OCR/ASR information density affects the same SUMMA-RL model. The “Sparse” category denotes samples with combined OCR/ASR character counts below 10, while “Rich” represents the converse condition. Table 4 shows that the model performs better as the OCR information content increases.

Mixed lexical + semantic rewards beat single-aspect rewards (RQ4)

We conduct ablation studies to investigate the effectiveness of our mixed-reward mechanism. As demonstrated in Table 5, models trained with the mixed reward achieve obviously better performance compared to those trained solely with either lexical or semantic reward. This empirical evidence suggests that rewards from these two distinct perspectives exhibit synergistic complementarity, enabling the final combined reward to effectively incentivize the model to generate high-quality ad summaries.

Figure 7 illustrates the evolution of training rewards and generated response lengths throughout the training process. As shown, the reward values exhibit a consistent upward trend during the initial phase, eventually converging to a stable plateau with minimal oscillations. Concurrently, the summary length remains relatively stable in the early stages, followed by a modest increase and subsequent gradual decline. This trajectory clearly demonstrates the effectiveness of our designed reward mechanism.

High-quality summaries enhance downstream retrieval and ranking (RQ5)

For diverse downstream tasks (i.e., Retrieval and Relevance Ranking, see Section 3.4), we compare SUMMA against multimodal embedding models. In retrieval, our in-house evaluation dataset consists of 30K query-ad pairs and 5M candidate ads, while the relevance ranking test set consists of 5K query-ad pairs.

As shown in Table 6 and Table 7, SUMMA demonstrates significant superiority over embedding-based methods in both search advertising tasks. These empirical results substantiate that SUMMA achieves more adequate utilization and sophisticated modeling of multimodal advertising contents, thereby validating the efficacy of our research trajectory.

Additionally, to evaluate the advantages of our method, we conduct assessments from the perspectives of granularity and diversity. First, we randomly select 10K queries, perform ANN searches to retrieve their top-100 similar ads from each index, and calculate the proportion of unique count of retrieved ads to all ads as the diversity ratio. Second, we define granularity ratio to quantify the impact of granularity perspectives on retrieval performance. Specifically, as the granularity of ad features becomes finer, the diversity of retrieved results for similar queries should correspondingly improve. Therefore, we select 10K queries with minimal semantic differences and repeat the same retrieval process as computing diversity ratio. As shown in Table 6, the diversity ratio increases by over 3% and the granularity ratio increases by over 2%, indicating our SUMMA can expand the diversity of retrieved advertisements while simultaneously excels in achieving granular discrimination among ad candidates.

4.4. Online Experiments

A production-level A/B testing is implemented by incorporating our SUMMA into the BERT-based retrieval and relevance models within our real-time search advertising system. While maintaining all other factors unchanged, we expose the experimental variant to 10% of total search ad traffic of our platform for 14 consecutive days to ensure statistical significance.

As is shown in Table 8, the experimental results demonstrate that our SUMMA yields a statistically significant 1.5% enhancement in advertising revenue. Additionally, through rigorous human evaluation protocols, we observe a 5% reduction in relevance bad rate. Overall, these quantitative improvements collectively validate SUMMA’s capacity to optimize the whole search advertising system while simultaneously improving the user experience.

Therefore, our SUMMA has been deployed as the core component supporting our entire search advertising ecosystem. Moreover, we note that beyond its fundamental role in powering retrieval and relevance ranking, SUMMA can also be utilized in enhancing feature integration for other ranking models like the ones targeting Click-Through Rate (CTR) and CVR (Conversion Rate) prediction.

In conclusion, the practical deployment of our SUMMA on the real-world online search ad system demonstrates its measurable benefits in production environments.

5. Conclusion

In this paper, we present SUMMA, a novel MLLM trained on our newly constructed multimodal advertising summarization datasets for supervised fine-tuning and reinforcement learning. As the first OCR/ASR-coordinated multimodal cognitive approach for search advertising, SUMMA can effectively process visual-text features in advertising videos to generate commercially valuable and concise video summaries. Besides, we further implement a pipeline integrated in the search advertising system where our SUMMA serves as the upstream component and synergizes with various downstream modules to perform candidate retrieval and relevance ranking. Empirical validation through offline experiments and online production environment testing confirms SUMMA’s superior performance. In our future research, we will explore more effective MLLMs RL strategies, such as incorporating more downstream tasks into the reward design to further refine the quality of multimodal advertising summaries, ultimately leading to more accurate user-ad matching and more enhanced user experience.

GenAI Usage Disclosure

We use Doubao to generate and verify our ad summarization SFT Data (AdSum-Doubao), and also employ it as the judge model to determine the semantic rewards in our RL training stage.

Bibliography75

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1)
2AI (2024) Meta AI. 2024. The Llama 3 Herd of Models. ar Xiv:2407.21783 [cs.AI] https://arxiv.org/abs/2407.21783
3Alayrac et al. (2022) Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, Ricardo Barreira, Oriol Vinyals, Andrew Zisserman, and Karen Simonyan. 2022. Flamingo: a Visual Language Model fo
4Bai et al. (2023) Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. 2023. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond. ar Xiv:2308.12966 [cs.CV] https://arxiv.org/abs/2308.12966
5Bai et al. (2025) Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. 2025. Qwen 2.5-VL Technical Report. ar Xiv:2502.13923 [cs.CV] https://arxiv.org/abs/2502.13923
6Byte Dance (2025) Byte Dance. 2025. Doubao-1.5-pro. https://seed.bytedance.com/en/special/doubao_1_5_pro
7Cai et al. (2024) Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, Xiaoyi Dong, Haodong Duan, Qi Fan, Zhaoye Fei, Yang Gao, Jiaye Ge, Chenya Gu, Yuzhe Gu, Tao Gui, Aijia Guo, Qipeng Guo, Conghui He, Yingfan Hu, Ting Huang, Tao Jiang, Penglong Jiao, Zhenjiang Jin, Zhikai Lei, Jiaxing Li, Jingwen Li, Linyang Li, Shuaibin Li, Wei Li, Yining Li, Hongwei Liu, Jiangning Liu, Jiawei Hong, Kaiwen Liu, Kuikun Liu, Xiaoran Liu, Chengqi Lv, H
8Chang et al. (2021) Wei-Cheng Chang, Daniel Jiang, Hsiang-Fu Yu, Choon-Hui Teo, Jiong Zhang, Kai Zhong, Kedarnath Kolluri, Qie Hu, Nikhil Shandilya, Vyacheslav Ievgrafov, Japinder Singh, and Inderjit S. Dhillon. 2021. Extreme Multi-label Learning for Semantic Matching in Product Search. ar Xiv:2106.12657 [cs.IR] https://arxiv.org/abs/2106.12657