Aligning and Prompting Everything All at Once for Universal Visual   Perception

Yunhang Shen; Chaoyou Fu; Peixian Chen; Mengdan Zhang; Ke Li; Xing; Sun; Yunsheng Wu; Shaohui Lin; Rongrong Ji

arXiv:2312.02153·cs.CV·December 5, 2023·2 cites

Aligning and Prompting Everything All at Once for Universal Visual Perception

Yunhang Shen, Chaoyou Fu, Peixian Chen, Mengdan Zhang, Ke Li, Xing, Sun, Yunsheng Wu, Shaohui Lin, Rongrong Ji

PDF

Open Access 2 Repos

TL;DR

APE introduces a universal visual perception model that aligns and prompts multiple vision tasks simultaneously, achieving state-of-the-art results across diverse datasets without task-specific fine-tuning.

Contribution

The paper proposes a novel instance-level sentence-object matching paradigm that unifies detection, segmentation, and grounding in a single model, advancing universal visual perception.

Findings

01

Outperforms state-of-the-art models on over 160 datasets.

02

Effectively scales to thousands of categories and region descriptions.

03

Achieves competitive results without task-specific fine-tuning.

Abstract

Vision foundation models have been explored recently to build general-purpose vision systems. However, predominant paradigms, driven by casting instance-level tasks as an object-word alignment, bring heavy cross-modality interaction, which is not effective in prompting object detection and visual grounding. Another line of work that focuses on pixel-level tasks often encounters a large annotation gap of things and stuff, and suffers from mutual interference between foreground-object and background-class segmentation. In stark contrast to the prevailing methods, we present APE, a universal visual perception model for aligning and prompting everything all at once in an image to perform diverse tasks, i.e., detection, segmentation, and grounding, as an instance-level sentence-object matching paradigm. Specifically, APE advances the convergence of detection and grounding by reformulating…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques