Online Reasoning Video Segmentation with Just-in-Time Digital Twins
Yiqing Shen, Bohan Liu, Chenjia Li, Lalithkumar Seenivasan, Mathias, Unberath

TL;DR
This paper introduces a novel online video reasoning segmentation framework using just-in-time digital twins, enabling efficient, multi-step reasoning without fine-tuning large language models, and presents a new benchmark for evaluation.
Contribution
The paper proposes a new agent framework that separates perception and reasoning, utilizing just-in-time digital twins to improve online video reasoning segmentation without LLM fine-tuning.
Findings
Effective handling of complex multi-step queries.
Scalable online video reasoning without fine-tuning.
Benchmark with 200 videos and 895 queries across multiple reasoning types.
Abstract
Reasoning segmentation (RS) aims to identify and segment objects of interest based on implicit text queries. As such, RS is a catalyst for embodied AI agents, enabling them to interpret high-level commands without requiring explicit step-by-step guidance. However, current RS approaches rely heavily on the visual perception capabilities of multimodal large language models (LLMs), leading to several major limitations. First, they struggle with queries that require multiple steps of reasoning or those that involve complex spatial/temporal relationships. Second, they necessitate LLM fine-tuning, which may require frequent updates to maintain compatibility with contemporary LLMs and may increase risks of catastrophic forgetting during fine-tuning. Finally, being primarily designed for static images or offline video processing, they scale poorly to online video data. To address these…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIndustrial Vision Systems and Defect Detection · Visual Attention and Saliency Detection
