EventRR: Event Referential Reasoning for Referring Video Object Segmentation
Huihui Xu, Jiashi Lin, Haoyu Chen, Junjun He, Lei Zhu

TL;DR
EventRR introduces a novel framework for referring video object segmentation that leverages semantic event structures and temporal reasoning, significantly improving performance over existing methods.
Contribution
The paper proposes EventRR, which decouples RVOS into summarization and reasoning parts, utilizing a Referential Event Graph and Temporal Concept-Role Reasoning for enhanced referent understanding.
Findings
Outperforms state-of-the-art RVOS methods on four benchmarks.
Effectively models event attributes and temporal relations in video referring expressions.
Demonstrates strong interpretability through concept-role reasoning steps.
Abstract
Referring Video Object Segmentation (RVOS) aims to segment out the object in a video referred by an expression. Current RVOS methods view referring expressions as unstructured sequences, neglecting their crucial semantic structure essential for referent reasoning. Besides, in contrast to image-referring expressions whose semantics focus only on object attributes and object-object relations, video-referring expressions also encompass event attributes and event-event temporal relations. This complexity challenges traditional structured reasoning image approaches. In this paper, we propose the Event Referential Reasoning (EventRR) framework. EventRR decouples RVOS into object summarization part and referent reasoning part. The summarization phase begins by summarizing each frame into a set of bottleneck tokens, which are then efficiently aggregated in the video-level summarization step to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
