Recurrent Mixture Density Network for Spatiotemporal Visual Attention

Loris Bazzani; Hugo Larochelle; Lorenzo Torresani

arXiv:1603.08199·cs.CV·February 14, 2017·115 cites

Recurrent Mixture Density Network for Spatiotemporal Visual Attention

Loris Bazzani, Hugo Larochelle, Lorenzo Torresani

PDF

Open Access

TL;DR

This paper introduces a spatiotemporal attention model for videos that learns where to focus based on human fixation data, combining Gaussian mixtures, deep features, and LSTMs for improved saliency prediction and action recognition.

Contribution

It presents a novel hierarchical model that integrates Gaussian mixture-based saliency, deep 3D features, and LSTMs, trained directly on human fixations for enhanced video attention and action classification.

Findings

01

Achieves state-of-the-art saliency prediction on Hollywood2.

02

Generalizes well to UCF101 for saliency and action recognition.

03

Improves action classification accuracy using learned attention.

Abstract

In many computer vision tasks, the relevant information to solve the problem at hand is mixed to irrelevant, distracting information. This has motivated researchers to design attentional models that can dynamically focus on parts of images or videos that are salient, e.g., by down-weighting irrelevant pixels. In this work, we propose a spatiotemporal attentional model that learns where to look in a video directly from human fixation data. We model visual attention with a mixture of Gaussians at each frame. This distribution is used to express the probability of saliency for each pixel. Time consistency in videos is modeled hierarchically by: 1) deep 3D convolutional features to represent spatial and short-term time relations and 2) a long short-term memory network on top that aggregates the clip-level representation of sequential clips and therefore expands the temporal domain from few…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVisual Attention and Saliency Detection · Video Surveillance and Tracking Methods · Advanced Image and Video Retrieval Techniques