Audio-Visual Glance Network for Efficient Video Recognition

Muhammad Adi Nugroho; Sangmin Woo; Sumin Lee; Changick Kim

arXiv:2308.09322·cs.CV·August 21, 2023

Audio-Visual Glance Network for Efficient Video Recognition

Muhammad Adi Nugroho, Sangmin Woo, Sumin Lee, Changick Kim

PDF

Open Access

TL;DR

The paper introduces AVGN, an efficient audio-visual network that selectively processes important video segments and patches, significantly reducing computation while maintaining state-of-the-art recognition accuracy.

Contribution

It proposes a novel multi-modal approach combining saliency estimation and patch attention to improve video recognition efficiency and accuracy.

Findings

01

Achieves state-of-the-art results on multiple benchmarks.

02

Reduces computational cost compared to traditional methods.

03

Demonstrates faster processing speeds without sacrificing accuracy.

Abstract

Deep learning has made significant strides in video understanding tasks, but the computation required to classify lengthy and massive videos using clip-level video classifiers remains impractical and prohibitively expensive. To address this issue, we propose Audio-Visual Glance Network (AVGN), which leverages the commonly available audio and visual modalities to efficiently process the spatio-temporally important parts of a video. AVGN firstly divides the video into snippets of image-audio clip pair and employs lightweight unimodal encoders to extract global visual features and audio features. To identify the important temporal segments, we use an Audio-Visual Temporal Saliency Transformer (AV-TeST) that estimates the saliency scores of each frame. To further increase efficiency in the spatial dimension, AVGN processes only the important patches instead of the whole images. We use an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Advanced Image and Video Retrieval Techniques · Image Enhancement Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Adam · Label Smoothing · Layer Normalization · Absolute Position Encodings · Residual Connection