Towards In-context Scene Understanding
Ivana Bala\v{z}evi\'c, David Steiner, Nikhil Parthasarathy, Relja, Arandjelovi\'c, Olivier J. H\'enaff

TL;DR
This paper introduces Hummingbird, a model that uses in-context learning with nearest neighbor retrieval for scene understanding tasks, achieving near-specialist performance without task-specific finetuning.
Contribution
The paper presents a novel pretraining protocol and a retrieval-based in-context learning approach for scene understanding, enabling flexible task execution without finetuning.
Findings
Hummingbird performs various scene understanding tasks without modification.
It approaches the performance of finetuned specialist models.
It enables efficient new task learning in interactive settings.
Abstract
In-context learningthe ability to configure a model's behavior with different promptshas revolutionized the field of natural language processing, alleviating the need for task-specific models and paving the way for generalist models capable of assisting with any query. Computer vision, in contrast, has largely stayed in the former regime: specialized decoders and finetuning protocols are generally required to perform dense tasks such as semantic segmentation and depth estimation. In this work we explore a simple mechanism for in-context learning of such scene understanding tasks: nearest neighbor retrieval from a prompt of annotated features. We propose a new pretraining protocolleveraging attention within and across imageswhich yields representations particularly useful in this regime. The resulting Hummingbird model,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
