Generalized Decoding for Pixel, Image, and Language

Xueyan Zou; Zi-Yi Dou; Jianwei Yang; Zhe Gan; Linjie Li; Chunyuan Li,; Xiyang Dai; Harkirat Behl; Jianfeng Wang; Lu Yuan; Nanyun Peng; Lijuan Wang,; Yong Jae Lee; Jianfeng Gao

arXiv:2212.11270·cs.CV·December 22, 2022·1 cites

Generalized Decoding for Pixel, Image, and Language

Xueyan Zou, Zi-Yi Dou, Jianwei Yang, Zhe Gan, Linjie Li, Chunyuan Li,, Xiyang Dai, Harkirat Behl, Jianfeng Wang, Lu Yuan, Nanyun Peng, Lijuan Wang,, Yong Jae Lee, Jianfeng Gao

PDF

Open Access 1 Repo

TL;DR

X-Decoder is a unified model that seamlessly integrates pixel-level segmentation and language understanding, enabling versatile vision-language tasks with strong transferability and state-of-the-art performance across multiple datasets.

Contribution

It introduces a novel generalized decoding framework supporting all image segmentation types and vision-language tasks in a single model without pseudo-labeling.

Findings

01

Achieves state-of-the-art results on open-vocabulary segmentation and referring segmentation.

02

Demonstrates strong transferability to various downstream tasks in zero-shot and finetuning settings.

03

Offers flexible and efficient finetuning for diverse vision-language applications.

Abstract

We present X-Decoder, a generalized decoding model that can predict pixel-level segmentation and language tokens seamlessly. X-Decodert takes as input two types of queries: (i) generic non-semantic queries and (ii) semantic queries induced from text inputs, to decode different pixel-level and token-level outputs in the same semantic space. With such a novel design, X-Decoder is the first work that provides a unified way to support all types of image segmentation and a variety of vision-language (VL) tasks. Further, our design enables seamless interactions across tasks at different granularities and brings mutual benefits by learning a common and rich pixel-level visual-semantic understanding space, without any pseudo-labeling. After pretraining on a mixed set of a limited amount of segmentation data and millions of image-text pairs, X-Decoder exhibits strong transferability to a wide…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

microsoft/X-Decoder
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling