Diversify, Rationalize, and Combine: Ensembling Multiple QA Strategies for Zero-shot Knowledge-based VQA
Miaoyu Li, Haoxin Li, Zilin Du, and Boyang Li

TL;DR
This paper introduces DietCoke, a multi-strategy ensemble framework for knowledge-based visual question answering that combines diverse reasoning tactics and rationales to improve accuracy over existing methods.
Contribution
It proposes a novel three-stage ensemble approach that diversifies question-answering strategies, generates rationales, and intelligently combines answers, advancing zero-shot K-VQA performance.
Findings
Outperforms state-of-the-art baselines by 2.8% on OK-VOA and 4.7% on A-OKVOA.
Demonstrates high complementarity among ensemble strategies.
Validates effectiveness of rationales in answer selection.
Abstract
Knowledge-based Visual Question-answering (K-VQA) often requires the use of background knowledge beyond the image. However, we discover that a single knowledge generation strategy is often insufficient for all K-VQA questions. To this end, we propose Diversification, Evidence Truncation, and Combination for Knowledge-based Elucidation (DietCoke), which utilizes a bundle of complementary question-answering tactics and aggregates their answers using textual rationales. DietCoke comprises of three stages: diversification, rationalization, and ensemble. The diversification stage generates three distinctive decision contexts, each leading to its own answer candidate. The rationalization stage generates two rationales, the automatic rationale and the mechanistic rationale, for each answer candidate using decorrelated techniques. Finally, in the ensemble stage, an LLM informed by the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIndustrial Vision Systems and Defect Detection
