EVP: Enhanced Visual Perception using Inverse Multi-Attentive Feature   Refinement and Regularized Image-Text Alignment

Mykola Lavreniuk; Shariq Farooq Bhat; Matthias M\"uller; Peter Wonka

arXiv:2312.08548·cs.CV·December 15, 2023·5 cites

EVP: Enhanced Visual Perception using Inverse Multi-Attentive Feature Refinement and Regularized Image-Text Alignment

Mykola Lavreniuk, Shariq Farooq Bhat, Matthias M\"uller, Peter Wonka

PDF

Open Access 1 Repo

TL;DR

This paper introduces EVP, an advanced network architecture that enhances visual perception by refining features and aligning images with text, achieving state-of-the-art results in depth estimation and referring segmentation tasks.

Contribution

The paper proposes the IMAFR module and a novel image-text alignment method, significantly improving feature learning and extraction in the Stable Diffusion backbone.

Findings

01

State-of-the-art depth estimation on NYU Depth v2 and KITTI datasets.

02

Significant IoU improvement in referring segmentation on RefCOCO.

03

Enhanced feature learning capabilities demonstrated through comprehensive experiments.

Abstract

This work presents the network architecture EVP (Enhanced Visual Perception). EVP builds on the previous work VPD which paved the way to use the Stable Diffusion network for computer vision tasks. We propose two major enhancements. First, we develop the Inverse Multi-Attentive Feature Refinement (IMAFR) module which enhances feature learning capabilities by aggregating spatial information from higher pyramid levels. Second, we propose a novel image-text alignment module for improved feature extraction of the Stable Diffusion backbone. The resulting architecture is suitable for a wide variety of tasks and we demonstrate its performance in the context of single-image depth estimation with a specialized decoder using classification-based bins and referring segmentation with an off-the-shelf decoder. Comprehensive experiments conducted on established datasets show that EVP achieves…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lavreniuk/evp
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Advanced Image and Video Retrieval Techniques · Image Enhancement Techniques

MethodsLinear Layer · Multi-Head Attention · Attention Is All You Need · Transformer · Diffusion