Learning Spatial-Temporal Graphs for Active Speaker Detection
Sourya Roy, Kyle Min, Subarna Tripathi, Tanaya Guha, Somdeb, Majumdar

TL;DR
This paper introduces SPELL, a graph-based framework for active speaker detection that models long-range multimodal relationships, improving accuracy and efficiency over existing methods.
Contribution
SPELL is a novel framework that learns spatial-temporal graphs for active speaker detection, capturing long-term dependencies and inter-modal relationships.
Findings
Outperforms relevant baselines in accuracy.
Achieves comparable performance to state-of-the-art models.
Requires significantly less computation.
Abstract
We address the problem of active speaker detection through a new framework, called SPELL, that learns long-range multimodal graphs to encode the inter-modal relationship between audio and visual data. We cast active speaker detection as a node classification task that is aware of longer-term dependencies. We first construct a graph from a video so that each node corresponds to one person. Nodes representing the same identity share edges between them within a defined temporal window. Nodes within the same video frame are also connected to encode inter-person interactions. Through extensive experiments on the Ava-ActiveSpeaker dataset, we demonstrate that learning graph-based representation, owing to its explicit spatial and temporal structure, significantly improves the overall performance. SPELL outperforms several relevant baselines and performs at par with state of the art models…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis
MethodsAttentive Walk-Aggregating Graph Neural Network
