Variational Structured Attention Networks for Deep Visual Representation Learning
Guanglei Yang, Paolo Rota, Xavier Alameda-Pineda, Dan Xu, Mingli Ding,, Elisa Ricci

TL;DR
This paper introduces VISTA-Net, a unified probabilistic framework that jointly learns structured spatial and channel attention for enhanced deep visual representation, significantly improving performance on various dense prediction tasks.
Contribution
It proposes a novel end-to-end trainable model that structures and models interactions between spatial and channel attentions within a probabilistic framework.
Findings
Outperforms state-of-the-art on six large-scale datasets
Effective joint learning of spatial and channel attentions
Improves accuracy in dense visual prediction tasks
Abstract
Convolutional neural networks have enabled major progresses in addressing pixel-level prediction tasks such as semantic segmentation, depth estimation, surface normal prediction and so on, benefiting from their powerful capabilities in visual representation learning. Typically, state of the art models integrate attention mechanisms for improved deep feature representations. Recently, some works have demonstrated the significance of learning and combining both spatial- and channelwise attentions for deep feature refinement. In this paper, weaim at effectively boosting previous approaches and propose a unified deep framework to jointly learn both spatial attention maps and channel attention vectors in a principled manner so as to structure the resulting attention tensors and model interactions between these two types of attentions. Specifically, we integrate the estimation and the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition
