Featurising Pixels from Dynamic 3D Scenes with Linear In-Context Learners
Nikita Araslanov, Martin Sundermeyer, Hidenobu Matsuki, David Joseph Tan, Federico Tombari

TL;DR
LILA is a novel framework that learns pixel-level feature descriptors from videos using linear in-context learning, effectively embedding spatio-temporal scene properties for various vision tasks.
Contribution
It introduces a new pixel-level representation learning method that leverages spatio-temporal cues and linear in-context learning, scalable to uncurated videos.
Findings
Improves performance on video object segmentation.
Enhances surface normal estimation accuracy.
Benefits semantic segmentation tasks.
Abstract
One of the most exciting applications of vision models involve pixel-level reasoning. Despite the abundance of vision foundation models, we still lack representations that effectively embed spatio-temporal properties of visual scenes at the pixel level. Existing frameworks either train on image-based pretext tasks, which do not account for dynamic elements, or on video sequences for action-level reasoning, which does not scale to dense pixel-level prediction. We present a framework that learns pixel-accurate feature descriptors from videos, LILA. The core element of our training framework is linear in-context learning. LILA leverages spatio-temporal cue maps -- depth and motion -- estimated with off-the-shelf networks. Despite the noisy nature of those cues, LILA trains effectively on uncurated video datasets, embedding semantic and geometric properties in a temporally consistent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
