DOMINO: A Dual-System for Multi-step Visual Language Reasoning
Peifang Wang, Olga Golovneva, Armen Aghajanyan, Xiang Ren and, Muhao Chen, Asli Celikyilmaz, Maryam Fazel-Zarandi

TL;DR
This paper introduces DOMINO, a dual-system approach for multi-step visual language reasoning that separates visual information extraction from deliberate reasoning, improving accuracy over existing methods.
Contribution
The work presents a novel dual-system framework that decomposes visual language reasoning into two steps, enhancing interpretability and performance with minimal training data.
Findings
Outperforms prior methods on chart and plot datasets.
Fine-tuning System-2 improves accuracy by over 5%.
Achieves state-of-the-art results with limited data.
Abstract
Visual language reasoning requires a system to extract text or numbers from information-dense images like charts or plots and perform logical or arithmetic reasoning to arrive at an answer. To tackle this task, existing work relies on either (1) an end-to-end vision-language model trained on a large amount of data, or (2) a two-stage pipeline where a captioning model converts the image into text that is further read by another large language model to deduce the answer. However, the former approach forces the model to answer a complex question with one single step, and the latter approach is prone to inaccurate or distracting information in the converted text that can confuse the language model. In this work, we propose a dual-system for multi-step multimodal reasoning, which consists of a "System-1" step for visual information extraction and a "System-2" step for deliberate reasoning.…
Peer Reviews
Decision·Submitted to ICLR 2024
1. The paper is clearly written. 2. The results are great, compared to few-shot baselines, and the performance gain is analyzed carefully. 3. The paper proposed a demonstration of two stage reasoning using LLMs for task decomposition using the feedback from perception results, which is novel compared to similar LLM-guided systems without feedback, e.g., [1]. The efficiency of fine-tuning of LLM also supports the decomposition of System-1/2. 4. The authors thoroughly discussed the functionality o
A few unclear points are raised in Questions.
1. This method is intuitive, and I am happy to see the introduction of dual-system into vision-language reasoning. 2. The proposed method achieves SOTA results on ChartQA. 3. Analysis shows that DOMINO is more robust in handling complex charts.
1. The author didn't discuss about the efficiency. How does the inference efficiency of DOMINO compare to the baseline method? 2. The template seems relatively limited, more non-chartQA tasks are needed to confirm the potential of this method.
• The paper is well written and easy to understand. Figure 1 provides a good overview of the complete system. • The paper presents promising results on ChartQA and outperforms prior supervised baselines. • The paper includes ablation studies in Figure 3.
• Novelty: The core idea of the paper is very similar to prior work, including “Visual Programming: Compositional Visual Reasoning Without Training, CVPR 2023” which also uses a large LLM for reasoning and perception modules to extract information from images. Additionally, “Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language, arXiv 2022” also performs zero-shot multi-modal reasoning in a similar fashion. “Look, Remember and Reason: Visual Reasoning with Grounded Rationales,
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
