Temporally-Constrained Video Reasoning Segmentation and Automated Benchmark Construction
Yiqing Shen, Chenjia Li, Chenxiao Fan, Mathias Unberath

TL;DR
This paper introduces a new temporally-constrained video reasoning segmentation task, addressing the limitations of existing methods by enabling dynamic object relevance understanding and providing an automated benchmark dataset.
Contribution
It proposes a novel task formulation for temporally-aware video segmentation and introduces an automated method for constructing relevant benchmark datasets.
Findings
Introduced temporally-constrained video reasoning segmentation task.
Developed an automated benchmark construction method.
Created the TCVideoRSBenchmark dataset with 52 samples.
Abstract
Conventional approaches to video segmentation are confined to predefined object categories and cannot identify out-of-vocabulary objects, let alone objects that are not identified explicitly but only referred to implicitly in complex text queries. This shortcoming limits the utility for video segmentation in complex and variable scenarios, where a closed set of object categories is difficult to define and where users may not know the exact object category that will appear in the video. Such scenarios can arise in operating room video analysis, where different health systems may use different workflows and instrumentation, requiring flexible solutions for video analysis. Reasoning segmentation (RS) now offers promise towards such a solution, enabling natural language text queries as interaction for identifying object to segment. However, existing video RS formulation assume that target…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Human Pose and Action Recognition
