Towards an Automated Multimodal Approach for Video Summarization: Building a Bridge Between Text, Audio and Facial Cue-Based Summarization
Md Moinul Islam, Sofoklis Kakouros, Janne Heikkil\"a, Mourad Oussalah

TL;DR
This paper introduces a behaviour-aware multimodal video summarization framework that integrates text, audio, and visual cues to produce more accurate and emotionally relevant summaries, outperforming traditional methods.
Contribution
It presents a novel multimodal approach that combines prosodic, textual, and visual features, including the concept of bonus words, to enhance video summarization quality.
Findings
Significant improvements in ROUGE-1 and BERTScore metrics.
Enhanced F1-Score in video-based evaluation by nearly 23%.
Effective integration of multimodal cues for semantic and emotional relevance.
Abstract
The increasing volume of video content in educational, professional, and social domains necessitates effective summarization techniques that go beyond traditional unimodal approaches. This paper proposes a behaviour-aware multimodal video summarization framework that integrates textual, audio, and visual cues to generate timestamp-aligned summaries. By extracting prosodic features, textual cues and visual indicators, the framework identifies semantically and emotionally important moments. A key contribution is the identification of bonus words, which are terms emphasized across multiple modalities and used to improve the semantic relevance and expressive clarity of the summaries. The approach is evaluated against pseudo-ground truth (pGT) summaries generated using LLM-based extractive method. Experimental results demonstrate significant improvements over traditional extractive method,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Topic Modeling
