Unified Open-Vocabulary Dense Visual Prediction

Hengcan Shi; Munawar Hayat; Jianfei Cai

arXiv:2307.08238·cs.CV·August 21, 2023·1 cites

Unified Open-Vocabulary Dense Visual Prediction

Hengcan Shi, Munawar Hayat, Jianfei Cai

PDF

Open Access

TL;DR

This paper introduces a unified network for open-vocabulary dense visual prediction tasks, leveraging multi-modal data and a specialized training mechanism to improve performance across multiple dense prediction tasks.

Contribution

It proposes a novel unified network architecture with multi-modal, multi-scale, and multi-task decoding, and a training mechanism to bridge domain and task gaps, enabling joint training for multiple dense prediction tasks.

Findings

01

Effective on four datasets, outperforming task-specific models.

02

Leverages diverse training data to enhance individual task performance.

03

Addresses multi-modal data integration and domain gaps in unified models.

Abstract

In recent years, open-vocabulary (OV) dense visual prediction (such as OV object detection, semantic, instance and panoptic segmentations) has attracted increasing research attention. However, most of existing approaches are task-specific and individually tackle each task. In this paper, we propose a Unified Open-Vocabulary Network (UOVN) to jointly address four common dense prediction tasks. Compared with separate models, a unified network is more desirable for diverse industrial applications. Moreover, OV dense prediction training data is relatively less. Separate networks can only leverage task-relevant training data, while a unified approach can integrate diverse training data to boost individual tasks. We address two major challenges in unified OV prediction. Firstly, unlike unified methods for fixed-set predictions, OV networks are usually trained with multi-modal data. Therefore,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques