Audiovisual SlowFast Networks for Video Recognition
Fanyi Xiao, Yong Jae Lee, Kristen Grauman, Jitendra Malik, Christoph, Feichtenhofer

TL;DR
This paper introduces Audiovisual SlowFast Networks, a unified architecture for video recognition that integrates audio and visual data at multiple levels, achieving state-of-the-art results across various datasets.
Contribution
It proposes a novel integrated audiovisual network with hierarchical fusion and a DropPathway regularization technique to improve training and representation learning.
Findings
Achieves state-of-the-art results on six video datasets.
Demonstrates effective audiovisual feature learning in self-supervised settings.
Shows the generalization of the model to various video recognition tasks.
Abstract
We present Audiovisual SlowFast Networks, an architecture for integrated audiovisual perception. AVSlowFast has Slow and Fast visual pathways that are deeply integrated with a Faster Audio pathway to model vision and sound in a unified representation. We fuse audio and visual features at multiple layers, enabling audio to contribute to the formation of hierarchical audiovisual concepts. To overcome training difficulties that arise from different learning dynamics for audio and visual modalities, we introduce DropPathway, which randomly drops the Audio pathway during training as an effective regularization technique. Inspired by prior studies in neuroscience, we perform hierarchical audiovisual synchronization to learn joint audiovisual features. We report state-of-the-art results on six video action classification and detection datasets, perform detailed ablation studies, and show the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Human Pose and Action Recognition · Speech and Audio Processing
MethodsAudiovisual SlowFast Network · DropPathway
