TransVLM: A Vision-Language Framework and Benchmark for Detecting Any Shot Transitions

Ce Chen; Yi Ren; Yuanming Li; Viktor Goriachko; Zhenhui Ye; Zujin Guo; Zhibin Hong; Mingming Gong

arXiv:2604.27975·cs.CV·May 1, 2026

TransVLM: A Vision-Language Framework and Benchmark for Detecting Any Shot Transitions

Ce Chen, Yi Ren, Yuanming Li, Viktor Goriachko, Zhenhui Ye, Zujin Guo, Zhibin Hong, Mingming Gong

PDF

1 Repo

TL;DR

TransVLM introduces a novel vision-language framework for detecting shot transitions in videos by explicitly modeling continuous segments and integrating optical flow, outperforming existing methods.

Contribution

The paper formalizes the Shot Transition Detection task and proposes TransVLM, a VLM that incorporates optical flow and a data engine for robust training and benchmarking.

Findings

01

TransVLM achieves superior performance over traditional methods.

02

Explicit motion modeling improves temporal awareness.

03

Synthetic data generation enhances training robustness.

Abstract

Traditional Shot Boundary Detection (SBD) inherently struggles with complex transitions by formulating the task around isolated cut points, frequently yielding corrupted video shots. We address this fundamental limitation by formalizing the Shot Transition Detection (STD) task. Rather than searching for ambiguous points, STD explicitly detects the continuous temporal segments of transitions. To tackle this, we propose TransVLM, a Vision-Language Model (VLM) framework for STD. Unlike regular VLMs that predominantly rely on spatial semantics and struggle with fine-grained inter-shot dynamics, our method explicitly injects optical flow as a critical motion prior at the input stage. Through a simple yet effective feature-fusion strategy, TransVLM directly processes concatenated color and motion representations, significantly enhancing its temporal awareness without incurring any additional…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://chence17.github.io/TransVLM
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.