Exploring the Effectiveness of Object-Centric Representations in Visual Question Answering: Comparative Insights with Foundation Models
Amir Mohammad Karimi Mamaghan, Samuele Papa, Karl Henrik Johansson,, Stefan Bauer, Andrea Dittadi

TL;DR
This paper empirically compares object-centric representations and foundation models in visual question answering, revealing their respective strengths and potential for combined use in complex scene understanding.
Contribution
It provides the first extensive empirical analysis of OC models versus foundation models in VQA, exploring their benefits, trade-offs, and integration potential.
Findings
OC models excel in compositional reasoning tasks.
Foundation models demonstrate strong generalization capabilities.
Combining both paradigms offers promising improvements.
Abstract
Object-centric (OC) representations, which model visual scenes as compositions of discrete objects, have the potential to be used in various downstream tasks to achieve systematic compositional generalization and facilitate reasoning. However, these claims have yet to be thoroughly validated empirically. Recently, foundation models have demonstrated unparalleled capabilities across diverse domains, from language to computer vision, positioning them as a potential cornerstone of future research for a wide range of computational tasks. In this paper, we conduct an extensive empirical study on representation learning for downstream Visual Question Answering (VQA), which requires an accurate compositional understanding of the scene. We thoroughly investigate the benefits and trade-offs of OC models and alternative approaches including large pre-trained foundation models on both synthetic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Visual Attention and Saliency Detection · Visual and Cognitive Learning Processes
