UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling
Zhengyuan Yang, Zhe Gan, Jianfeng Wang, Xiaowei Hu, Faisal Ahmed,, Zicheng Liu, Yumao Lu, Lijuan Wang

TL;DR
UniTAB unifies text and box outputs in a single model for grounded vision-language tasks, improving interpretability and performance across multiple benchmarks by representing outputs with a shared token sequence.
Contribution
It introduces a novel unified token sequence approach with a special <obj> token, enabling simultaneous text and box generation with alignments, simplifying architecture and enhancing performance.
Findings
Outperforms state-of-the-art in grounded captioning and captioning tasks.
Achieves comparable or better results than task-specific models across 7 VL benchmarks.
Parameter-efficient and generalizable to new tasks.
Abstract
We propose UniTAB that Unifies Text And Box outputs for grounded vision-language (VL) modeling. Grounded VL tasks such as grounded captioning require the model to generate a text description and align predicted words with object regions. To achieve this, models must generate desired text and box outputs together, and meanwhile indicate the alignments between words and boxes. In contrast to existing solutions that use multiple separate modules for different outputs, UniTAB represents both text and box outputs with a shared token sequence, and introduces a special <obj> token to naturally indicate word-box alignments in the sequence. UniTAB thus could provide a more comprehensive and interpretable image description, by freely grounding generated words to object regions. On grounded captioning, UniTAB presents a simpler solution with a single output head, and significantly outperforms…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
MethodsALIGN
