Recurrent Mixture Density Network for Spatiotemporal Visual Attention
Loris Bazzani, Hugo Larochelle, Lorenzo Torresani

TL;DR
This paper introduces a spatiotemporal attention model for videos that learns where to focus based on human fixation data, combining Gaussian mixtures, deep features, and LSTMs for improved saliency prediction and action recognition.
Contribution
It presents a novel hierarchical model that integrates Gaussian mixture-based saliency, deep 3D features, and LSTMs, trained directly on human fixations for enhanced video attention and action classification.
Findings
Achieves state-of-the-art saliency prediction on Hollywood2.
Generalizes well to UCF101 for saliency and action recognition.
Improves action classification accuracy using learned attention.
Abstract
In many computer vision tasks, the relevant information to solve the problem at hand is mixed to irrelevant, distracting information. This has motivated researchers to design attentional models that can dynamically focus on parts of images or videos that are salient, e.g., by down-weighting irrelevant pixels. In this work, we propose a spatiotemporal attentional model that learns where to look in a video directly from human fixation data. We model visual attention with a mixture of Gaussians at each frame. This distribution is used to express the probability of saliency for each pixel. Time consistency in videos is modeled hierarchically by: 1) deep 3D convolutional features to represent spatial and short-term time relations and 2) a long short-term memory network on top that aggregates the clip-level representation of sequential clips and therefore expands the temporal domain from few…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection · Video Surveillance and Tracking Methods · Advanced Image and Video Retrieval Techniques
