Audio-Visual Glance Network for Efficient Video Recognition
Muhammad Adi Nugroho, Sangmin Woo, Sumin Lee, Changick Kim

TL;DR
The paper introduces AVGN, an efficient audio-visual network that selectively processes important video segments and patches, significantly reducing computation while maintaining state-of-the-art recognition accuracy.
Contribution
It proposes a novel multi-modal approach combining saliency estimation and patch attention to improve video recognition efficiency and accuracy.
Findings
Achieves state-of-the-art results on multiple benchmarks.
Reduces computational cost compared to traditional methods.
Demonstrates faster processing speeds without sacrificing accuracy.
Abstract
Deep learning has made significant strides in video understanding tasks, but the computation required to classify lengthy and massive videos using clip-level video classifiers remains impractical and prohibitively expensive. To address this issue, we propose Audio-Visual Glance Network (AVGN), which leverages the commonly available audio and visual modalities to efficiently process the spatio-temporally important parts of a video. AVGN firstly divides the video into snippets of image-audio clip pair and employs lightweight unimodal encoders to extract global visual features and audio features. To identify the important temporal segments, we use an Audio-Visual Temporal Saliency Transformer (AV-TeST) that estimates the saliency scores of each frame. To further increase efficiency in the spatial dimension, AVGN processes only the important patches instead of the whole images. We use an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Advanced Image and Video Retrieval Techniques · Image Enhancement Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Adam · Label Smoothing · Layer Normalization · Absolute Position Encodings · Residual Connection
