TL;DR
This paper introduces a multi-modal saliency prediction model for multiple-face videos that incorporates visual, audio, and face information, demonstrating improved accuracy over existing methods and aligning more closely with human attention patterns.
Contribution
The paper presents a novel multi-modal saliency model that integrates visual, audio, and face cues, supported by a large-scale eye-tracking database and outperforming existing methods.
Findings
Outperforms 11 state-of-the-art saliency models
Aligns closely with human multi-modal attention
Validates the influence of audio on visual saliency
Abstract
Recently, video streams have occupied a large proportion of Internet traffic, most of which contain human faces. Hence, it is necessary to predict saliency on multiple-face videos, which can provide attention cues for many content based applications. However, most of multiple-face saliency prediction works only consider visual information and ignore audio, which is not consistent with the naturalistic scenarios. Several behavioral studies have established that sound influences human attention, especially during the speech turn-taking in multiple-face videos. In this paper, we thoroughly investigate such influences by establishing a large-scale eye-tracking database of Multiple-face Video in Visual-Audio condition (MVVA). Inspired by the findings of our investigation, we propose a novel multi-modal video saliency model consisting of three branches: visual, audio and face. The visual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
