Enhanced Textual Feature Extraction for Visual Question Answering: A Simple Convolutional Approach
Zhilin Zhang, Fangyu Wu

TL;DR
This paper compares complex and simple textual encoders in VQA, revealing that simpler models can be more effective and introducing ConvGRU, a lightweight convolutional model that improves question understanding with less complexity.
Contribution
It provides a comprehensive comparison of textual encoders in VQA and proposes ConvGRU, a convolutional approach that enhances text features efficiently.
Findings
Complex encoders are not always optimal for VQA-v2.
ConvGRU improves performance on Number and Count questions.
Lightweight models can be effective in resource-constrained settings.
Abstract
Visual Question Answering (VQA) has emerged as a highly engaging field in recent years, with increasing research focused on enhancing VQA accuracy through advanced models such as Transformers. Despite this growing interest, limited work has examined the comparative effectiveness of textual encoders in VQA, particularly considering model complexity and computational efficiency. In this work, we conduct a comprehensive comparison between complex textual models that leverage long-range dependencies and simpler models focusing on local textual features within a well-established VQA framework. Our findings reveal that employing complex textual encoders is not always the optimal approach for the VQA-v2 dataset. Motivated by this insight, we propose ConvGRU, a model that incorporates convolutional layers to improve text feature representation without substantially increasing model complexity.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization
