All in Tokens: Unifying Output Space of Visual Tasks via Soft Token

Jia Ning; Chen Li; Zheng Zhang; Zigang Geng; Qi Dai; Kun He; Han Hu

arXiv:2301.02229·cs.CV·February 15, 2023·1 cites

All in Tokens: Unifying Output Space of Visual Tasks via Soft Token

Jia Ning, Chen Li, Zheng Zhang, Zigang Geng, Qi Dai, Kun He, Han Hu

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a unified model for visual tasks using soft tokens, enabling simultaneous handling of instance segmentation and depth estimation with improved accuracy and a new record on NYUv2 depth estimation.

Contribution

The paper proposes a novel soft token technique and mask augmentation to unify output spaces of diverse visual tasks within a single model.

Findings

01

Achieved state-of-the-art 0.279 RMSE on NYUv2 depth estimation.

02

Unified model performs well on both instance segmentation and depth estimation.

03

Soft tokens improve output inference and decoding accuracy.

Abstract

Unlike language tasks, where the output space is usually limited to a set of tokens, the output space of visual tasks is more complicated, making it difficult to build a unified visual model for various visual tasks. In this paper, we seek to unify the output space of visual tasks, so that we can also build a unified model for visual tasks. To this end, we demonstrate a single unified model that simultaneously handles two typical visual tasks of instance segmentation and depth estimation, which have discrete/fixed-length and continuous/varied-length outputs, respectively. We propose several new techniques that take into account the particularity of visual tasks: 1) Soft token. We employ soft token to represent the task output. Unlike hard tokens in the common VQ-VAE which are assigned one-hot to discrete codebooks/vocabularies, the soft token is assigned softly to the codebook…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

swintransformer/ait
pytorchOfficial

Videos

All in Tokens: Unifying Output Space of Visual Tasks via Soft Token· youtube

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Image Processing Techniques and Applications · Advanced Vision and Imaging

MethodsVQ-VAE