Aligning and Prompting Everything All at Once for Universal Visual Perception
Yunhang Shen, Chaoyou Fu, Peixian Chen, Mengdan Zhang, Ke Li, Xing, Sun, Yunsheng Wu, Shaohui Lin, Rongrong Ji

TL;DR
APE introduces a universal visual perception model that aligns and prompts multiple vision tasks simultaneously, achieving state-of-the-art results across diverse datasets without task-specific fine-tuning.
Contribution
The paper proposes a novel instance-level sentence-object matching paradigm that unifies detection, segmentation, and grounding in a single model, advancing universal visual perception.
Findings
Outperforms state-of-the-art models on over 160 datasets.
Effectively scales to thousands of categories and region descriptions.
Achieves competitive results without task-specific fine-tuning.
Abstract
Vision foundation models have been explored recently to build general-purpose vision systems. However, predominant paradigms, driven by casting instance-level tasks as an object-word alignment, bring heavy cross-modality interaction, which is not effective in prompting object detection and visual grounding. Another line of work that focuses on pixel-level tasks often encounters a large annotation gap of things and stuff, and suffers from mutual interference between foreground-object and background-class segmentation. In stark contrast to the prevailing methods, we present APE, a universal visual perception model for aligning and prompting everything all at once in an image to perform diverse tasks, i.e., detection, segmentation, and grounding, as an instance-level sentence-object matching paradigm. Specifically, APE advances the convergence of detection and grounding by reformulating…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
