Trying Bilinear Pooling in Video-QA
Thomas Winterbottom, Sarah Xiao, Alistair McLean, Noura Al Moubayed

TL;DR
This paper investigates the application of bilinear pooling (BLP) techniques to video question answering (video-QA), finding that simple integration often harms performance and providing insights into the challenges of using BLP in this domain.
Contribution
The study applies BLP methods to various video-QA benchmarks, revealing their limited effectiveness and offering best practices for future application in video-QA tasks.
Findings
BLP integration generally harms video-QA performance
Theoretical analysis explains challenges of BLP in video-QA
Recommendations for effective BLP application in video-QA
Abstract
Bilinear pooling (BLP) refers to a family of operations recently developed for fusing features from different modalities predominantly developed for VQA models. A bilinear (outer-product) expansion is thought to encourage models to learn interactions between two feature spaces and has experimentally outperformed `simpler' vector operations (concatenation and element-wise-addition/multiplication) on VQA benchmarks. Successive BLP techniques have yielded higher performance with lower computational expense and are often implemented alongside attention mechanisms. However, despite significant progress in VQA, BLP methods have not been widely applied to more recently explored video question answering (video-QA) tasks. In this paper, we begin to bridge this research gap by applying BLP techniques to various video-QA benchmarks, namely: TVQA, TGIF-QA, Ego-VQA and MSVD-QA. We share our results…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
