Learning Generative Vision Transformer with Energy-Based Latent Space for Saliency Prediction
Jing Zhang, Jianwen Xie, Nick Barnes, Ping Li

TL;DR
This paper introduces a generative vision transformer with an energy-based latent prior for saliency detection, enabling accurate predictions and meaningful uncertainty maps, trained via MCMC methods.
Contribution
It proposes a novel generative vision transformer with an energy-based prior for improved saliency detection and uncertainty estimation.
Findings
Achieves accurate saliency predictions on RGB and RGB-D data.
Generates meaningful uncertainty maps aligned with human perception.
Outperforms existing models in saliency detection accuracy.
Abstract
Vision transformer networks have shown superiority in many computer vision tasks. In this paper, we take a step further by proposing a novel generative vision transformer with latent variables following an informative energy-based prior for salient object detection. Both the vision transformer network and the energy-based prior model are jointly trained via Markov chain Monte Carlo-based maximum likelihood estimation, in which the sampling from the intractable posterior and prior distributions of the latent variables are performed by Langevin dynamics. Further, with the generative vision transformer, we can easily obtain a pixel-wise uncertainty map from an image, which indicates the model confidence in predicting saliency from the image. Different from the existing generative models which define the prior distribution of the latent variables as a simple isotropic Gaussian distribution,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsVisual Attention and Saliency Detection · Face Recognition and Perception · Aesthetic Perception and Analysis
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Layer Normalization · Residual Connection · Dense Connections · Vision Transformer
