UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language   Modeling

Zhengyuan Yang; Zhe Gan; Jianfeng Wang; Xiaowei Hu; Faisal Ahmed,; Zicheng Liu; Yumao Lu; Lijuan Wang

arXiv:2111.12085·cs.CV·July 28, 2022

UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling

Zhengyuan Yang, Zhe Gan, Jianfeng Wang, Xiaowei Hu, Faisal Ahmed,, Zicheng Liu, Yumao Lu, Lijuan Wang

PDF

Open Access 1 Repo 5 Models

TL;DR

UniTAB unifies text and box outputs in a single model for grounded vision-language tasks, improving interpretability and performance across multiple benchmarks by representing outputs with a shared token sequence.

Contribution

It introduces a novel unified token sequence approach with a special <obj> token, enabling simultaneous text and box generation with alignments, simplifying architecture and enhancing performance.

Findings

01

Outperforms state-of-the-art in grounded captioning and captioning tasks.

02

Achieves comparable or better results than task-specific models across 7 VL benchmarks.

03

Parameter-efficient and generalizable to new tasks.

Abstract

We propose UniTAB that Unifies Text And Box outputs for grounded vision-language (VL) modeling. Grounded VL tasks such as grounded captioning require the model to generate a text description and align predicted words with object regions. To achieve this, models must generate desired text and box outputs together, and meanwhile indicate the alignments between words and boxes. In contrast to existing solutions that use multiple separate modules for different outputs, UniTAB represents both text and box outputs with a shared token sequence, and introduces a special <obj> token to naturally indicate word-box alignments in the sequence. UniTAB thus could provide a more comprehensive and interpretable image description, by freely grounding generated words to object regions. On grounded captioning, UniTAB presents a simpler solution with a single output head, and significantly outperforms…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

microsoft/UniTAB
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling

MethodsALIGN