Scene-VLM: Multimodal Video Scene Segmentation via Vision-Language Models
Nimrod Berman, Adam Botach, Emanuel Ben-Baruch, Shunit Haviv Hakimi, Asaf Gendler, Ilan Naiman, Erez Yosef, Igor Kviatkovsky

TL;DR
Scene-VLM introduces a multimodal vision-language model for video scene segmentation, leveraging sequential reasoning, contextual cues, and explainability to outperform existing methods on standard benchmarks.
Contribution
It is the first fine-tuned VLM framework for video scene segmentation that jointly processes multimodal cues and incorporates sequential dependencies and explainability.
Findings
Achieves state-of-the-art performance on MovieNet with +6 AP and +13.7 F1 improvements.
Introduces a confidence scoring scheme for controllable precision-recall trade-offs.
Enables generation of natural-language rationales for boundary decisions.
Abstract
Segmenting long-form videos into semantically coherent scenes is a fundamental task in large-scale video understanding. Existing encoder-based methods are limited by visual-centric biases, classify each shot in isolation without leveraging sequential dependencies, and lack both narrative understanding and explainability. In this paper, we present Scene-VLM, the first fine-tuned vision-language model (VLM) framework for video scene segmentation. Scene-VLM jointly processes visual and textual cues including frames, transcriptions, and optional metadata to enable multimodal reasoning across consecutive shots. The model generates predictions sequentially with causal dependencies among shots and introduces a context-focus window mechanism to ensure sufficient temporal context for each shot-level decision. In addition, we propose a scheme to extract confidence scores from the token-level…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Visual Attention and Saliency Detection
