All in Tokens: Unifying Output Space of Visual Tasks via Soft Token
Jia Ning, Chen Li, Zheng Zhang, Zigang Geng, Qi Dai, Kun He, Han Hu

TL;DR
This paper introduces a unified model for visual tasks using soft tokens, enabling simultaneous handling of instance segmentation and depth estimation with improved accuracy and a new record on NYUv2 depth estimation.
Contribution
The paper proposes a novel soft token technique and mask augmentation to unify output spaces of diverse visual tasks within a single model.
Findings
Achieved state-of-the-art 0.279 RMSE on NYUv2 depth estimation.
Unified model performs well on both instance segmentation and depth estimation.
Soft tokens improve output inference and decoding accuracy.
Abstract
Unlike language tasks, where the output space is usually limited to a set of tokens, the output space of visual tasks is more complicated, making it difficult to build a unified visual model for various visual tasks. In this paper, we seek to unify the output space of visual tasks, so that we can also build a unified model for visual tasks. To this end, we demonstrate a single unified model that simultaneously handles two typical visual tasks of instance segmentation and depth estimation, which have discrete/fixed-length and continuous/varied-length outputs, respectively. We propose several new techniques that take into account the particularity of visual tasks: 1) Soft token. We employ soft token to represent the task output. Unlike hard tokens in the common VQ-VAE which are assigned one-hot to discrete codebooks/vocabularies, the soft token is assigned softly to the codebook…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
All in Tokens: Unifying Output Space of Visual Tasks via Soft Token· youtube
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Image Processing Techniques and Applications · Advanced Vision and Imaging
MethodsVQ-VAE
