EVP: Enhanced Visual Perception using Inverse Multi-Attentive Feature Refinement and Regularized Image-Text Alignment
Mykola Lavreniuk, Shariq Farooq Bhat, Matthias M\"uller, Peter Wonka

TL;DR
This paper introduces EVP, an advanced network architecture that enhances visual perception by refining features and aligning images with text, achieving state-of-the-art results in depth estimation and referring segmentation tasks.
Contribution
The paper proposes the IMAFR module and a novel image-text alignment method, significantly improving feature learning and extraction in the Stable Diffusion backbone.
Findings
State-of-the-art depth estimation on NYU Depth v2 and KITTI datasets.
Significant IoU improvement in referring segmentation on RefCOCO.
Enhanced feature learning capabilities demonstrated through comprehensive experiments.
Abstract
This work presents the network architecture EVP (Enhanced Visual Perception). EVP builds on the previous work VPD which paved the way to use the Stable Diffusion network for computer vision tasks. We propose two major enhancements. First, we develop the Inverse Multi-Attentive Feature Refinement (IMAFR) module which enhances feature learning capabilities by aggregating spatial information from higher pyramid levels. Second, we propose a novel image-text alignment module for improved feature extraction of the Stable Diffusion backbone. The resulting architecture is suitable for a wide variety of tasks and we demonstrate its performance in the context of single-image depth estimation with a specialized decoder using classification-based bins and referring segmentation with an off-the-shelf decoder. Comprehensive experiments conducted on established datasets show that EVP achieves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Advanced Image and Video Retrieval Techniques · Image Enhancement Techniques
MethodsLinear Layer · Multi-Head Attention · Attention Is All You Need · Transformer · Diffusion
