Towards In-context Scene Understanding

Ivana Bala\v{z}evi\'c; David Steiner; Nikhil Parthasarathy; Relja; Arandjelovi\'c; Olivier J. H\'enaff

arXiv:2306.01667·cs.CV·November 1, 2023·6 cites

Towards In-context Scene Understanding

Ivana Bala\v{z}evi\'c, David Steiner, Nikhil Parthasarathy, Relja, Arandjelovi\'c, Olivier J. H\'enaff

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces Hummingbird, a model that uses in-context learning with nearest neighbor retrieval for scene understanding tasks, achieving near-specialist performance without task-specific finetuning.

Contribution

The paper presents a novel pretraining protocol and a retrieval-based in-context learning approach for scene understanding, enabling flexible task execution without finetuning.

Findings

01

Hummingbird performs various scene understanding tasks without modification.

02

It approaches the performance of finetuned specialist models.

03

It enables efficient new task learning in interactive settings.

Abstract

In-context learning $\unicode x 2013$ the ability to configure a model's behavior with different prompts $\unicode x 2013$ has revolutionized the field of natural language processing, alleviating the need for task-specific models and paving the way for generalist models capable of assisting with any query. Computer vision, in contrast, has largely stayed in the former regime: specialized decoders and finetuning protocols are generally required to perform dense tasks such as semantic segmentation and depth estimation. In this work we explore a simple mechanism for in-context learning of such scene understanding tasks: nearest neighbor retrieval from a prompt of annotated features. We propose a new pretraining protocol $\unicode x 2013$ leveraging attention within and across images $\unicode x 2013$ which yields representations particularly useful in this regime. The resulting Hummingbird model,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

vpariza/open-hummingbird-eval
pytorch

Videos

Towards In-context Scene Understanding· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques