Scene-VLM: Multimodal Video Scene Segmentation via Vision-Language Models

Nimrod Berman; Adam Botach; Emanuel Ben-Baruch; Shunit Haviv Hakimi; Asaf Gendler; Ilan Naiman; Erez Yosef; Igor Kviatkovsky

arXiv:2512.21778·cs.CV·March 24, 2026

Scene-VLM: Multimodal Video Scene Segmentation via Vision-Language Models

Nimrod Berman, Adam Botach, Emanuel Ben-Baruch, Shunit Haviv Hakimi, Asaf Gendler, Ilan Naiman, Erez Yosef, Igor Kviatkovsky

PDF

Open Access

TL;DR

Scene-VLM introduces a multimodal vision-language model for video scene segmentation, leveraging sequential reasoning, contextual cues, and explainability to outperform existing methods on standard benchmarks.

Contribution

It is the first fine-tuned VLM framework for video scene segmentation that jointly processes multimodal cues and incorporates sequential dependencies and explainability.

Findings

01

Achieves state-of-the-art performance on MovieNet with +6 AP and +13.7 F1 improvements.

02

Introduces a confidence scoring scheme for controllable precision-recall trade-offs.

03

Enables generation of natural-language rationales for boundary decisions.

Abstract

Segmenting long-form videos into semantically coherent scenes is a fundamental task in large-scale video understanding. Existing encoder-based methods are limited by visual-centric biases, classify each shot in isolation without leveraging sequential dependencies, and lack both narrative understanding and explainability. In this paper, we present Scene-VLM, the first fine-tuned vision-language model (VLM) framework for video scene segmentation. Scene-VLM jointly processes visual and textual cues including frames, transcriptions, and optional metadata to enable multimodal reasoning across consecutive shots. The model generates predictions sequentially with causal dependencies among shots and introduces a context-focus window mechanism to ensure sufficient temporal context for each shot-level decision. In addition, we propose a scheme to extract confidence scores from the token-level…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Visual Attention and Saliency Detection