TripleSumm: Adaptive Triple-Modality Fusion for Video Summarization
Sumin Kim, Hyemin Jeong, Mingu Kang, Yejin Kim, Yoori Oh, Joonseok Lee

TL;DR
TripleSumm is an adaptive multimodal video summarization method that dynamically fuses visual, text, and audio data at the frame level, significantly improving summarization quality.
Contribution
It introduces TripleSumm, a novel adaptive fusion architecture, and MoSu, the first large-scale multimodal video summarization benchmark.
Findings
Achieves state-of-the-art performance on four benchmarks.
Outperforms existing methods by a significant margin.
Provides a comprehensive multimodal video summarization dataset.
Abstract
The exponential growth of video content necessitates effective video summarization to efficiently extract key information from long videos. However, current approaches struggle to fully comprehend complex videos, primarily because they employ static or modality-agnostic fusion strategies. These methods fail to account for the dynamic, frame-dependent variations in modality saliency inherent in video data. To overcome these limitations, we propose TripleSumm, a novel architecture that adaptively weights and fuses the contributions of visual, text, and audio modalities at the frame level. Furthermore, a significant bottleneck for research into multimodal video summarization has been the lack of comprehensive benchmarks. Addressing this bottleneck, we introduce MoSu (Most Replayed Multimodal Video Summarization), the first large-scale benchmark that provides all three modalities. Extensive…
Peer Reviews
Decision·ICLR 2026 Poster
Clear, modular architecture that explicitly separates temporal refinement (MST) from cross-modal fusion (CMF). Strong, consistent gains across multiple datasets/metrics with a small model size. Careful ablation validating design choices (windowing schedule, dynamic fusion, modality combos).
Reliance on “Most Replayed” as ground truth, while pragmatic, can encode popularity/behavior biases; human alignment on MoSu isn’t quantified beyond transfers to SumMe/TVSum. It would be better to human-study agreement or correlation with editorial summaries. Selection uses standard KTS + knapsack; could the model be trained end-to-end with a differentiable or learning-to-select objective, and would that change results? While parameter-efficient, the inference cost with four MST blocks and C
1. Well-motivated architecture design — TripleSumm performs adaptive, frame-level fusion of visual, audio, and text signals using specialized Modality and Temporal Blocks, enabling both fine-grained and long-range semantic capture. 2. Benchmark contribution — MoSu fills a critical gap by offering the first large-scale trimodal dataset for video summarization, which meaningfully advances evaluation and reproducibility in this field. 3. Strong empirical results — TripleSumm achieves state-of-the-a
1. Dataset Details and Quality Control The paper should include more comprehensive details about the proposed dataset, such as the average and variance of summary lengths, textual and audio statistics, and the distribution of video durations. Moreover, the authors need to justify why the generated summaries can be considered high-quality representations of the original videos under the current construction pipeline. It is strongly recommended that the summary quality be validated through human
- S1: The separation between temporal modeling (MST) and cross-modal fusion (CMF) is elegant and easy to interpret. - S2: The framework effectively captures both intra-modal temporal dependencies and inter-modal relationships. - S3: The new dataset provides valuable resources for future research on multimodal summarization.
- W1: This paper shows relatively weak originality. The proposed model mainly focuses on optimizing multimodal (visual, text, audio) feature representations through standard attention operations, where self-attention is used to enhance intra-modal features and cross-attention is used for inter-modal fusion. The motivation and methodology closely resemble those of earlier works such as UMT [1] and CFSum [2], without demonstrating a clear conceptual or technical advancement beyond them. - W2: Thi
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
