Kernel-based Sensor Fusion with Application to Audio-Visual Voice Activity Detection
David Dov, Ronen Talmon, Israel Cohen

TL;DR
This paper introduces a kernel-based sensor fusion method for audio-visual voice activity detection, analyzing kernel bandwidth selection's impact on robustness to noise and interferences, with demonstrated superior performance over existing methods.
Contribution
It proposes a novel algorithm for kernel bandwidth selection in sensor fusion, enhancing robustness in noisy audio-visual voice activity detection tasks.
Findings
Kernel bandwidth selection improves robustness to noise.
Proposed method outperforms existing voice activity detection approaches.
Proper kernel parameter tuning enhances fusion performance.
Abstract
In this paper, we address the problem of multiple view data fusion in the presence of noise and interferences. Recent studies have approached this problem using kernel methods, by relying particularly on a product of kernels constructed separately for each view. From a graph theory point of view, we analyze this fusion approach in a discrete setting. More specifically, based on a statistical model for the connectivity between data points, we propose an algorithm for the selection of the kernel bandwidth, a parameter, which, as we show, has important implications on the robustness of this fusion approach to interferences. Then, we consider the fusion of audio-visual speech signals measured by a single microphone and by a video camera pointed to the face of the speaker. Specifically, we address the task of voice activity detection, i.e., the detection of speech and non-speech segments, in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
