MM-Pyramid: Multimodal Pyramid Attentional Network for Audio-Visual   Event Localization and Video Parsing

Jiashuo Yu; Ying Cheng; Rui-Wei Zhao; Rui Feng; Yuejie Zhang

arXiv:2111.12374·cs.CV·July 13, 2022

MM-Pyramid: Multimodal Pyramid Attentional Network for Audio-Visual Event Localization and Video Parsing

Jiashuo Yu, Ying Cheng, Rui-Wei Zhao, Rui Feng, Yuejie Zhang

PDF

Open Access 1 Repo

TL;DR

The paper introduces MM-Pyramid, a multimodal pyramid attentional network that captures multi-scale semantic features for improved audio-visual event localization and video parsing.

Contribution

It proposes a novel attentive feature pyramid module and an adaptive semantic fusion module for multi-scale event localization in videos.

Findings

01

Effective in localizing events of different lengths

02

Outperforms previous methods on audio-visual event localization

03

Improves weakly-supervised video parsing accuracy

Abstract

Recognizing and localizing events in videos is a fundamental task for video understanding. Since events may occur in auditory and visual modalities, multimodal detailed perception is essential for complete scene comprehension. Most previous works attempted to analyze videos from a holistic perspective. However, they do not consider semantic information at multiple scales, which makes the model difficult to localize events in different lengths. In this paper, we present a Multimodal Pyramid Attentional Network (\textbf{MM-Pyramid}) for event localization. Specifically, we first propose the attentive feature pyramid module. This module captures temporal pyramid features via several stacking pyramid units, each of them is composed of a fixed-size attention block and dilated convolution block. We also design an adaptive semantic fusion module, which leverages a unit-level attention block…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

JustinYuu/MM_Pyramid
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Video Analysis and Summarization

MethodsDilated Convolution · Convolution