Learnable Irrelevant Modality Dropout for Multimodal Action Recognition on Modality-Specific Annotated Videos
Saghir Alfasly, Jian Lu, Chen Xu, Yuru Zou

TL;DR
This paper introduces a novel framework for multimodal action recognition that effectively leverages audio in vision-specific datasets by using semantic label mapping and a learnable dropout mechanism, improving performance on standard benchmarks.
Contribution
It proposes a new semantic audio-video label dictionary and a learnable irrelevant modality dropout to enhance audio utilization in vision-only datasets for action recognition.
Findings
Outperforms existing methods on Kinetics400 and UCF-101 datasets.
Effectively filters irrelevant audio modalities during training.
Enhances visual modality modeling with a new two-stream Transformer.
Abstract
With the assumption that a video dataset is multimodality annotated in which auditory and visual modalities both are labeled or class-relevant, current multimodal methods apply modality fusion or cross-modality attention. However, effectively leveraging the audio modality in vision-specific annotated videos for action recognition is of particular challenge. To tackle this challenge, we propose a novel audio-visual framework that effectively leverages the audio modality in any solely vision-specific annotated dataset. We adopt the language models (e.g., BERT) to build a semantic audio-video label dictionary (SAVLD) that maps each video label to its most K-relevant audio labels in which SAVLD serves as a bridge between audio and video datasets. Then, SAVLD along with a pretrained audio multi-label model are used to estimate the audio-visual modality relevance during the training phase.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Hand Gesture Recognition Systems · Anomaly Detection Techniques and Applications
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dense Connections · Residual Connection · Label Smoothing · Softmax · Adam · Absolute Position Encodings · Position-Wise Feed-Forward Layer
