Robust Egocentric Referring Video Object Segmentation via Dual-Modal Causal Intervention

Haijing Liu; Zhiyuan Song; Hefeng Wu; Tao Pu; Keze Wang; Liang Lin

arXiv:2512.24323·cs.CV·January 1, 2026

Robust Egocentric Referring Video Object Segmentation via Dual-Modal Causal Intervention

Haijing Liu, Zhiyuan Song, Hefeng Wu, Tao Pu, Keze Wang, Liang Lin

PDF

Open Access

TL;DR

This paper introduces CERES, a causal framework that enhances egocentric video object segmentation by addressing dataset biases and visual confounders, leading to improved robustness and state-of-the-art results.

Contribution

The paper proposes a novel causal intervention approach for egocentric video segmentation, integrating dual-modal causal adjustments to improve robustness against biases and confounders.

Findings

01

CERES achieves state-of-the-art performance on Ego-RVOS benchmarks.

02

Causal interventions improve robustness to egocentric distortions.

03

Dual-modal approach effectively mitigates dataset biases and visual confounders.

Abstract

Egocentric Referring Video Object Segmentation (Ego-RVOS) aims to segment the specific object actively involved in a human action, as described by a language query, within first-person videos. This task is critical for understanding egocentric human behavior. However, achieving such segmentation robustly is challenging due to ambiguities inherent in egocentric videos and biases present in training data. Consequently, existing methods often struggle, learning spurious correlations from skewed object-action pairings in datasets and fundamental visual confounding factors of the egocentric perspective, such as rapid motion and frequent occlusions. To address these limitations, we introduce Causal Ego-REferring Segmentation (CERES), a plug-in causal framework that adapts strong, pre-trained RVOS backbones to the egocentric domain. CERES implements dual-modal causal intervention: applying…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Neural Network Applications