A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input
Mateusz Malinowski, Mario Fritz

TL;DR
This paper introduces a multi-world Bayesian approach to answer complex questions about real-world images, integrating NLP and computer vision to handle uncertainty and provide diverse answer types.
Contribution
It presents a novel multi-world Bayesian framework for visual question answering that manages uncertainty and handles complex, realistic scene questions.
Findings
Established a new benchmark for visual question answering.
Demonstrated the system's ability to handle complex, high-uncertainty questions.
Achieved diverse answer types including counts, object classes, and lists.
Abstract
We propose a method for automatically answering questions about images by bringing together recent advances from natural language processing and computer vision. We combine discrete reasoning with uncertain predictions by a multi-world approach that represents uncertainty about the perceived world in a bayesian framework. Our approach can handle human questions of high complexity about realistic scenes and replies with range of answer like counts, object classes, instances and lists of them. The system is directly trained from question-answer pairs. We establish a first benchmark for this task that can be seen as a modern attempt at a visual turing test.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques
