Towards General Purpose Vision Systems
Tanmay Gupta, Amita Kamath, Aniruddha Kembhavi, Derek Hoiem

TL;DR
This paper introduces GPV-1, a versatile vision-language model capable of handling diverse tasks without architectural modifications, aiming to simplify the development of general-purpose vision systems.
Contribution
The paper presents GPV-1, a task-agnostic architecture for vision tasks, along with evaluation methods for generality, transfer, and efficiency, advancing towards truly general-purpose vision systems.
Findings
GPV-1 performs well across multiple tasks.
GPV-1 can do zero-shot referring expressions.
Few-shot training improves GPV-1's zero-shot performance.
Abstract
Computer vision systems today are primarily N-purpose systems, designed and trained for a predefined set of tasks. Adapting such systems to new tasks is challenging and often requires non-trivial modifications to the network architecture (e.g. adding new output heads) or training process (e.g. adding new losses). To reduce the time and expertise required to develop new applications, we would like to create general purpose vision systems that can learn and perform a range of tasks without any modification to the architecture or learning process. In this paper, we propose GPV-1, a task-agnostic vision-language architecture that can learn and perform tasks that involve receiving an image and producing text and/or bounding boxes, including classification, localization, visual question answering, captioning, and more. We also propose evaluations of generality of architecture, skill-concept…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
