# AVA-ActiveSpeaker: An Audio-Visual Dataset for Active Speaker Detection

**Authors:** Joseph Roth, Sourish Chaudhuri, Ondrej Klejch, Radhika Marvin, Andrew, Gallagher, Liat Kaver, Sharadh Ramaswamy, Arkadiusz Stopczynski, Cordelia, Schmid, Zhonghua Xi, Caroline Pantofaru

arXiv: 1901.01342 · 2019-05-28

## TL;DR

This paper introduces the AVA-ActiveSpeaker dataset, a large, labeled audio-visual dataset for active speaker detection, and presents a new approach demonstrating its effectiveness and the dataset's value for advancing research.

## Contribution

The paper provides a publicly available, large-scale dataset for active speaker detection and proposes a novel audio-visual method, enabling better evaluation and development of algorithms.

## Key findings

- The dataset contains 3.65 million labeled frames and 38.5 hours of data.
- The new approach outperforms previous methods in active speaker detection.
- Analysis shows the dataset improves algorithm robustness and accuracy.

## Abstract

Active speaker detection is an important component in video analysis algorithms for applications such as speaker diarization, video re-targeting for meetings, speech enhancement, and human-robot interaction. The absence of a large, carefully labeled audio-visual dataset for this task has constrained algorithm evaluations with respect to data diversity, environments, and accuracy. This has made comparisons and improvements difficult. In this paper, we present the AVA Active Speaker detection dataset (AVA-ActiveSpeaker) that will be released publicly to facilitate algorithm development and enable comparisons. The dataset contains temporally labeled face tracks in video, where each face instance is labeled as speaking or not, and whether the speech is audible. This dataset contains about 3.65 million human labeled frames or about 38.5 hours of face tracks, and the corresponding audio. We also present a new audio-visual approach for active speaker detection, and analyze its performance, demonstrating both its strength and the contributions of the dataset.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1901.01342/full.md

## Figures

45 figures with captions in the complete paper: https://tomesphere.com/paper/1901.01342/full.md

## References

50 references — full list in the complete paper: https://tomesphere.com/paper/1901.01342/full.md

---
Source: https://tomesphere.com/paper/1901.01342