Temporal Bilinear Encoding Network of Audio-Visual Features at Low Sampling Rates
Feiyan Hu, Eva Mohedano, Noel O'Connor, Kevin McGuinness

TL;DR
This paper introduces Temporal Bilinear Encoding Networks (TBEN) for low-sampling-rate video classification, effectively capturing long-range audio-visual temporal information with less computational cost, and achieves state-of-the-art results.
Contribution
The paper proposes TBEN, a novel bilinear pooling approach for encoding long-range temporal features in low-sampling-rate videos, and incorporates label hierarchy for improved robustness.
Findings
Achieves state-of-the-art accuracy on FGA240 dataset (hit@1=47.95%)
Close to state-of-the-art on UCF101 with significantly less computation
Bilinear pooling outperforms average pooling for low-sampling-rate videos
Abstract
Current deep learning based video classification architectures are typically trained end-to-end on large volumes of data and require extensive computational resources. This paper aims to exploit audio-visual information in video classification with a 1 frame per second sampling rate. We propose Temporal Bilinear Encoding Networks (TBEN) for encoding both audio and visual long range temporal information using bilinear pooling and demonstrate bilinear pooling is better than average pooling on the temporal dimension for videos with low sampling rate. We also embed the label hierarchy in TBEN to further improve the robustness of the classifier. Experiments on the FGA240 fine-grained classification dataset using TBEN achieve a new state-of-the-art (hit@1=47.95%). We also exploit the possibility of incorporating TBEN with multiple decoupled modalities like visual semantic and motion features:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Video Analysis and Summarization · Speech and Audio Processing
MethodsAverage Pooling
