Audio-Visual Talker Localization in Video for Spatial Sound Reproduction
Davide Berghi, Philip J. B. Jackson

TL;DR
This paper presents a novel audio-visual method for localizing active speakers in videos, combining multichannel audio and visual cues to improve spatial accuracy and detection reliability in media production.
Contribution
It introduces an integrated audio-visual approach that leverages multichannel audio and visual data for active speaker detection, outperforming previous single-channel and audio-only methods.
Findings
Multichannel audio reduces detection error by double digits compared to single-channel methods.
Combining multichannel audio with visual data increases F1 score by four percentage points.
Multichannel audio overcomes visual occlusion issues, enhancing localization accuracy.
Abstract
Object-based audio production requires the positional metadata to be defined for each point-source object, including the key elements in the foreground of the sound scene. In many media production use cases, both cameras and microphones are employed to make recordings, and the human voice is often a key element. In this research, we detect and locate the active speaker in the video, facilitating the automatic extraction of the positional metadata of the talker relative to the camera's reference frame. With the integration of the visual modality, this study expands upon our previous investigation focused solely on audio-based active speaker detection and localization. Our experiments compare conventional audio-visual approaches for active speaker detection that leverage monaural audio, our previous audio-only method that leverages multichannel recordings from a microphone array, and a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing
