TriPSS: A Tri-Modal Keyframe Extraction Framework Using Perceptual, Structural, and Semantic Representations
Mert Can Cakmak, Nitin Agarwal, Diwash Poudel

TL;DR
TriPSS is a novel tri-modal framework that combines perceptual, structural, and semantic features for effective keyframe extraction, significantly improving video summarization performance.
Contribution
It introduces a multi-modal fusion approach using perceptual, structural, and semantic representations, combined with adaptive clustering and refinement for superior keyframe selection.
Findings
Achieves state-of-the-art results on TVSum20 and SumMe benchmarks.
Effectively captures complementary visual and semantic cues.
Outperforms unimodal and previous multimodal methods.
Abstract
Efficient keyframe extraction is critical for video summarization and retrieval, yet capturing the full semantic and visual richness of video content remains challenging. We introduce TriPSS, a tri-modal framework that integrates perceptual features from the CIELAB color space, structural embeddings from ResNet-50, and semantic context from frame-level captions generated by LLaMA-3.2-11B-Vision-Instruct. These modalities are fused using principal component analysis to form compact multi-modal embeddings, enabling adaptive video segmentation via HDBSCAN clustering. A refinement stage incorporating quality assessment and duplicate filtering ensures the final keyframe set is both concise and semantically diverse. Evaluations on the TVSum20 and SumMe benchmarks show that TriPSS achieves state-of-the-art performance, significantly outperforming both unimodal and prior multimodal approaches.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSparse Evolutionary Training
