Egocentric Audio-Visual Noise Suppression
Roshan Sharma, Weipeng He, Ju Lin, Egor Lakomkin, Yang Liu and, Kaustubh Kalgaonkar

TL;DR
This paper introduces a novel egocentric audio-visual noise suppression method that leverages visual cues from non-lip facial visuals and actions, employing multi-task learning to improve noise reduction in videos where the speaker is off-screen.
Contribution
It proposes a new framework that uses visual features for noise suppression in egocentric videos, including multi-task learning for better noise and event discrimination.
Findings
Visual features improve noise suppression performance.
Multi-task learning enhances noise reduction and acoustic event detection.
The model outperforms audio-only baselines across various conditions.
Abstract
This paper studies audio-visual noise suppression for egocentric videos -- where the speaker is not captured in the video. Instead, potential noise sources are visible on screen with the camera emulating the off-screen speaker's view of the outside world. This setting is different from prior work in audio-visual speech enhancement that relies on lip and facial visuals. In this paper, we first demonstrate that egocentric visual information is helpful for noise suppression. We compare object recognition and action classification-based visual feature extractors and investigate methods to align audio and visual representations. Then, we examine different fusion strategies for the aligned features, and locations within the noise suppression model to incorporate visual information. Experiments demonstrate that visual features are most helpful when used to generate additive correction masks.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Hearing Loss and Rehabilitation · Advanced Adaptive Filtering Techniques
MethodsALIGN
