Temporally-Constrained Video Reasoning Segmentation and Automated Benchmark Construction

Yiqing Shen; Chenjia Li; Chenxiao Fan; Mathias Unberath

arXiv:2507.16718·cs.CV·July 23, 2025

Temporally-Constrained Video Reasoning Segmentation and Automated Benchmark Construction

Yiqing Shen, Chenjia Li, Chenxiao Fan, Mathias Unberath

PDF

Open Access

TL;DR

This paper introduces a new temporally-constrained video reasoning segmentation task, addressing the limitations of existing methods by enabling dynamic object relevance understanding and providing an automated benchmark dataset.

Contribution

It proposes a novel task formulation for temporally-aware video segmentation and introduces an automated method for constructing relevant benchmark datasets.

Findings

01

Introduced temporally-constrained video reasoning segmentation task.

02

Developed an automated benchmark construction method.

03

Created the TCVideoRSBenchmark dataset with 52 samples.

Abstract

Conventional approaches to video segmentation are confined to predefined object categories and cannot identify out-of-vocabulary objects, let alone objects that are not identified explicitly but only referred to implicitly in complex text queries. This shortcoming limits the utility for video segmentation in complex and variable scenarios, where a closed set of object categories is difficult to define and where users may not know the exact object category that will appear in the video. Such scenarios can arise in operating room video analysis, where different health systems may use different workflows and instrumentation, requiring flexible solutions for video analysis. Reasoning segmentation (RS) now offers promise towards such a solution, enabling natural language text queries as interaction for identifying object to segment. However, existing video RS formulation assume that target…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Human Pose and Action Recognition