Learning Spatial-Temporal Graphs for Active Speaker Detection

Sourya Roy; Kyle Min; Subarna Tripathi; Tanaya Guha; Somdeb; Majumdar

arXiv:2112.01479·cs.CV·December 7, 2021

Learning Spatial-Temporal Graphs for Active Speaker Detection

Sourya Roy, Kyle Min, Subarna Tripathi, Tanaya Guha, Somdeb, Majumdar

PDF

Open Access

TL;DR

This paper introduces SPELL, a graph-based framework for active speaker detection that models long-range multimodal relationships, improving accuracy and efficiency over existing methods.

Contribution

SPELL is a novel framework that learns spatial-temporal graphs for active speaker detection, capturing long-term dependencies and inter-modal relationships.

Findings

01

Outperforms relevant baselines in accuracy.

02

Achieves comparable performance to state-of-the-art models.

03

Requires significantly less computation.

Abstract

We address the problem of active speaker detection through a new framework, called SPELL, that learns long-range multimodal graphs to encode the inter-modal relationship between audio and visual data. We cast active speaker detection as a node classification task that is aware of longer-term dependencies. We first construct a graph from a video so that each node corresponds to one person. Nodes representing the same identity share edges between them within a defined temporal window. Nodes within the same video frame are also connected to encode inter-person interactions. Through extensive experiments on the Ava-ActiveSpeaker dataset, we demonstrate that learning graph-based representation, owing to its explicit spatial and temporal structure, significantly improves the overall performance. SPELL outperforms several relevant baselines and performs at par with state of the art models…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis

MethodsAttentive Walk-Aggregating Graph Neural Network