Bridging the Gap: Exploring the Capabilities of Bridge-Architectures for Complex Visual Reasoning Tasks
Kousik Rajesh, Mrigank Raman, Mohammed Asad Karim, Pranit Chawla

TL;DR
This paper evaluates bridge-architectures, which project image embeddings into text space, on complex visual reasoning tasks like NLVR2, finding that multi-modal pre-training is crucial for performance and that adding object features does not improve results.
Contribution
The study extends bridge-architectures with object features for NLVR2 and compares their performance to transformer-based models, highlighting the importance of multi-modal pre-training.
Findings
Adding object features does not improve performance.
Multi-modal pre-training is essential for complex reasoning.
LLaVA shows promising zero-shot results.
Abstract
In recent times there has been a surge of multi-modal architectures based on Large Language Models, which leverage the zero shot generation capabilities of LLMs and project image embeddings into the text space and then use the auto-regressive capacity to solve tasks such as VQA, captioning, and image retrieval. We name these architectures as "bridge-architectures" as they project from the image space to the text space. These models deviate from the traditional recipe of training transformer based multi-modal models, which involve using large-scale pre-training and complex multi-modal interactions through co or cross attention. However, the capabilities of bridge architectures have not been tested on complex visual reasoning tasks which require fine grained analysis about the image. In this project, we investigate the performance of these bridge-architectures on the NLVR2 dataset, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Advanced Image and Video Retrieval Techniques
