Learning Generative Vision Transformer with Energy-Based Latent Space   for Saliency Prediction

Jing Zhang; Jianwen Xie; Nick Barnes; Ping Li

arXiv:2112.13528·cs.CV·December 28, 2021·45 cites

Learning Generative Vision Transformer with Energy-Based Latent Space for Saliency Prediction

Jing Zhang, Jianwen Xie, Nick Barnes, Ping Li

PDF

Open Access 1 Video

TL;DR

This paper introduces a generative vision transformer with an energy-based latent prior for saliency detection, enabling accurate predictions and meaningful uncertainty maps, trained via MCMC methods.

Contribution

It proposes a novel generative vision transformer with an energy-based prior for improved saliency detection and uncertainty estimation.

Findings

01

Achieves accurate saliency predictions on RGB and RGB-D data.

02

Generates meaningful uncertainty maps aligned with human perception.

03

Outperforms existing models in saliency detection accuracy.

Abstract

Vision transformer networks have shown superiority in many computer vision tasks. In this paper, we take a step further by proposing a novel generative vision transformer with latent variables following an informative energy-based prior for salient object detection. Both the vision transformer network and the energy-based prior model are jointly trained via Markov chain Monte Carlo-based maximum likelihood estimation, in which the sampling from the intractable posterior and prior distributions of the latent variables are performed by Langevin dynamics. Further, with the generative vision transformer, we can easily obtain a pixel-wise uncertainty map from an image, which indicates the model confidence in predicting saliency from the image. Different from the existing generative models which define the prior distribution of the latent variables as a simple isotropic Gaussian distribution,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Learning Generative Vision Transformer with Energy-Based Latent Space for Saliency Prediction· slideslive

Taxonomy

TopicsVisual Attention and Saliency Detection · Face Recognition and Perception · Aesthetic Perception and Analysis

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Layer Normalization · Residual Connection · Dense Connections · Vision Transformer