TL;DR
TransVLM introduces a novel vision-language framework for detecting shot transitions in videos by explicitly modeling continuous segments and integrating optical flow, outperforming existing methods.
Contribution
The paper formalizes the Shot Transition Detection task and proposes TransVLM, a VLM that incorporates optical flow and a data engine for robust training and benchmarking.
Findings
TransVLM achieves superior performance over traditional methods.
Explicit motion modeling improves temporal awareness.
Synthetic data generation enhances training robustness.
Abstract
Traditional Shot Boundary Detection (SBD) inherently struggles with complex transitions by formulating the task around isolated cut points, frequently yielding corrupted video shots. We address this fundamental limitation by formalizing the Shot Transition Detection (STD) task. Rather than searching for ambiguous points, STD explicitly detects the continuous temporal segments of transitions. To tackle this, we propose TransVLM, a Vision-Language Model (VLM) framework for STD. Unlike regular VLMs that predominantly rely on spatial semantics and struggle with fine-grained inter-shot dynamics, our method explicitly injects optical flow as a critical motion prior at the input stage. Through a simple yet effective feature-fusion strategy, TransVLM directly processes concatenated color and motion representations, significantly enhancing its temporal awareness without incurring any additional…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
