Object Referring in Videos with Language and Human Gaze
Arun Balajee Vasudevan, Dengxin Dai, Luc Van Gool

TL;DR
This paper introduces a new video dataset and a novel neural network model for object referring in videos, leveraging appearance, motion, gaze, and spatio-temporal context to improve localization accuracy.
Contribution
It provides the first large-scale dataset with gaze annotations for object referring in videos and proposes an integrated model that combines multiple cues for better performance.
Findings
Our method outperforms previous object referring methods.
Incorporating gaze and motion cues improves localization accuracy.
The dataset enables research on gaze-based and spatio-temporal object understanding.
Abstract
We investigate the problem of object referring (OR) i.e. to localize a target object in a visual scene coming with a language description. Humans perceive the world more as continued video snippets than as static images, and describe objects not only by their appearance, but also by their spatio-temporal context and motion features. Humans also gaze at the object when they issue a referring expression. Existing works for OR mostly focus on static images only, which fall short in providing many such cues. This paper addresses OR in videos with language and human gaze. To that end, we present a new video dataset for OR, with 30, 000 objects over 5, 000 stereo video sequences annotated for their descriptions and gaze. We further propose a novel network model for OR in videos, by integrating appearance, motion, gaze, and spatio-temporal context into one network. Experimental results show…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
