Generalization Differences between End-to-End and Neuro-Symbolic Vision-Language Reasoning Systems
Wang Zhu, Jesse Thomason, Robin Jia

TL;DR
This paper compares end-to-end and neuro-symbolic vision-language reasoning systems across various out-of-distribution tests, revealing their strengths and weaknesses, and emphasizing the need for diverse robustness evaluations.
Contribution
It provides a comprehensive analysis of how these two paradigms perform under different generalization scenarios, highlighting their complementary benefits.
Findings
End-to-end systems show significant performance drops on all tests.
Neuro-symbolic methods perform worse on cross-benchmark transfer but better on other tests.
Few-shot training quickly improves neuro-symbolic methods' performance.
Abstract
For vision-and-language reasoning tasks, both fully connectionist, end-to-end methods and hybrid, neuro-symbolic methods have achieved high in-distribution performance. In which out-of-distribution settings does each paradigm excel? We investigate this question on both single-image and multi-image visual question-answering through four types of generalization tests: a novel segment-combine test for multi-image queries, contrast set, compositional generalization, and cross-benchmark transfer. Vision-and-language end-to-end trained systems exhibit sizeable performance drops across all these tests. Neuro-symbolic methods suffer even more on cross-benchmark transfer from GQA to VQA, but they show smaller accuracy drops on the other generalization tests and their performance quickly improves by few-shot training. Overall, our results demonstrate the complementary benefits of these two…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
MethodsTest
