TriPSS: A Tri-Modal Keyframe Extraction Framework Using Perceptual, Structural, and Semantic Representations

Mert Can Cakmak; Nitin Agarwal; Diwash Poudel

arXiv:2506.05395·cs.CV·September 3, 2025

TriPSS: A Tri-Modal Keyframe Extraction Framework Using Perceptual, Structural, and Semantic Representations

Mert Can Cakmak, Nitin Agarwal, Diwash Poudel

PDF

TL;DR

TriPSS is a novel tri-modal framework that combines perceptual, structural, and semantic features for effective keyframe extraction, significantly improving video summarization performance.

Contribution

It introduces a multi-modal fusion approach using perceptual, structural, and semantic representations, combined with adaptive clustering and refinement for superior keyframe selection.

Findings

01

Achieves state-of-the-art results on TVSum20 and SumMe benchmarks.

02

Effectively captures complementary visual and semantic cues.

03

Outperforms unimodal and previous multimodal methods.

Abstract

Efficient keyframe extraction is critical for video summarization and retrieval, yet capturing the full semantic and visual richness of video content remains challenging. We introduce TriPSS, a tri-modal framework that integrates perceptual features from the CIELAB color space, structural embeddings from ResNet-50, and semantic context from frame-level captions generated by LLaMA-3.2-11B-Vision-Instruct. These modalities are fused using principal component analysis to form compact multi-modal embeddings, enabling adaptive video segmentation via HDBSCAN clustering. A refinement stage incorporating quality assessment and duplicate filtering ensures the final keyframe set is both concise and semantically diverse. Evaluations on the TVSum20 and SumMe benchmarks show that TriPSS achieves state-of-the-art performance, significantly outperforming both unimodal and prior multimodal approaches.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSparse Evolutionary Training