A Unified Sequence Interface for Vision Tasks

Ting Chen; Saurabh Saxena; Lala Li; Tsung-Yi Lin; David J. Fleet,; Geoffrey Hinton

arXiv:2206.07669·cs.CV·October 18, 2022·49 cites

A Unified Sequence Interface for Vision Tasks

Ting Chen, Saurabh Saxena, Lala Li, Tsung-Yi Lin, David J. Fleet,, Geoffrey Hinton

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a unified sequence-based framework for multiple vision tasks, enabling a single model architecture to perform object detection, segmentation, keypoint detection, and captioning without task-specific modifications.

Contribution

The authors propose a shared pixel-to-sequence interface that unifies diverse vision tasks, allowing a single model to handle multiple tasks with prompts and sequence outputs.

Findings

01

Achieves competitive performance across tasks

02

Uses a single architecture and loss function

03

No task-specific customization needed

Abstract

While language tasks are naturally expressed in a single, unified, modeling framework, i.e., generating sequences of tokens, this has not been the case in computer vision. As a result, there is a proliferation of distinct architectures and loss functions for different vision tasks. In this work we show that a diverse set of "core" computer vision tasks can also be unified if formulated in terms of a shared pixel-to-sequence interface. We focus on four tasks, namely, object detection, instance segmentation, keypoint detection, and image captioning, all with diverse types of outputs, e.g., bounding boxes or dense masks. Despite that, by formulating the output of each task as a sequence of discrete tokens with a unified interface, we show that one can train a neural network with a single model architecture and loss function on all these tasks, with no task-specific customization. To solve…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

google-research/pix2seq
tfOfficial

Videos

A Unified Sequence Interface for Vision Tasks· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Advanced Neural Network Applications