# Leveraging Generative Models for Real-Time Query-Driven Text Summarization in Large-Scale Web Search

**Authors:** Zeyu Xiong, Yixuan Nan, Li Gao, Hengzhu Tang, Shuaiqiang Wang, Junfeng Wang, Dawei Yin

arXiv: 2508.20559 · 2025-08-29

## TL;DR

This paper introduces a generative model-based framework for real-time query-driven text summarization in large-scale web search, significantly improving accuracy and efficiency over traditional extractive methods.

## Contribution

It pioneers the application of generative models for QDTS, integrating model distillation, fine-tuning, preference optimization, and lookahead decoding for industrial-scale deployment.

## Key findings

- Outperforms existing baselines on industry-relevant metrics.
- Achieves state-of-the-art results in QDTS quality.
- Handles approximately 50,000 queries per second with under 55ms latency.

## Abstract

In the dynamic landscape of large-scale web search, Query-Driven Text Summarization (QDTS) aims to generate concise and informative summaries from textual documents based on a given query, which is essential for improving user engagement and facilitating rapid decision-making. Traditional extractive summarization models, based primarily on ranking candidate summary segments, have been the dominant approach in industrial applications. However, these approaches suffer from two key limitations: 1) The multi-stage pipeline often introduces cumulative information loss and architectural bottlenecks due to its weakest component; 2) Traditional models lack sufficient semantic understanding of both user queries and documents, particularly when dealing with complex search intents. In this study, we propose a novel framework to pioneer the application of generative models to address real-time QDTS in industrial web search. Our approach integrates large model distillation, supervised fine-tuning, direct preference optimization, and lookahead decoding to transform a lightweight model with only 0.1B parameters into a domain-specialized QDTS expert. Evaluated on multiple industry-relevant metrics, our model outperforms the production baseline and achieves a new state of the art. Furthermore, it demonstrates excellent deployment efficiency, requiring only 334 NVIDIA L20 GPUs to handle \textasciitilde50,000 queries per second under 55~ms average latency per query.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2508.20559/full.md

## Figures

2 figures with captions in the complete paper: https://tomesphere.com/paper/2508.20559/full.md

## References

40 references — full list in the complete paper: https://tomesphere.com/paper/2508.20559/full.md

---
Source: https://tomesphere.com/paper/2508.20559