Learnable Irrelevant Modality Dropout for Multimodal Action Recognition   on Modality-Specific Annotated Videos

Saghir Alfasly; Jian Lu; Chen Xu; Yuru Zou

arXiv:2203.03014·cs.CV·March 29, 2022

Learnable Irrelevant Modality Dropout for Multimodal Action Recognition on Modality-Specific Annotated Videos

Saghir Alfasly, Jian Lu, Chen Xu, Yuru Zou

PDF

Open Access

TL;DR

This paper introduces a novel framework for multimodal action recognition that effectively leverages audio in vision-specific datasets by using semantic label mapping and a learnable dropout mechanism, improving performance on standard benchmarks.

Contribution

It proposes a new semantic audio-video label dictionary and a learnable irrelevant modality dropout to enhance audio utilization in vision-only datasets for action recognition.

Findings

01

Outperforms existing methods on Kinetics400 and UCF-101 datasets.

02

Effectively filters irrelevant audio modalities during training.

03

Enhances visual modality modeling with a new two-stream Transformer.

Abstract

With the assumption that a video dataset is multimodality annotated in which auditory and visual modalities both are labeled or class-relevant, current multimodal methods apply modality fusion or cross-modality attention. However, effectively leveraging the audio modality in vision-specific annotated videos for action recognition is of particular challenge. To tackle this challenge, we propose a novel audio-visual framework that effectively leverages the audio modality in any solely vision-specific annotated dataset. We adopt the language models (e.g., BERT) to build a semantic audio-video label dictionary (SAVLD) that maps each video label to its most K-relevant audio labels in which SAVLD serves as a bridge between audio and video datasets. Then, SAVLD along with a pretrained audio multi-label model are used to estimate the audio-visual modality relevance during the training phase.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Hand Gesture Recognition Systems · Anomaly Detection Techniques and Applications

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dense Connections · Residual Connection · Label Smoothing · Softmax · Adam · Absolute Position Encodings · Position-Wise Feed-Forward Layer