QG-CoC: Question-Guided Chain-of-Captions for Large Multimodal Models
Kuei-Chun Kao, Hsu Tzu-Yin, Yunqi Hong, Ruochen Wang, Cho-Jui Hsieh

TL;DR
This paper identifies limitations in current multimodal large language models for multi-image reasoning and introduces QG-CoC, a new zero-shot prompting method that improves perception and reasoning across multiple images.
Contribution
The paper proposes QG-CoC, a novel zero-shot prompting approach that enhances multi-image reasoning in large multimodal models, addressing gaps in perception and integration.
Findings
QG-CoC outperforms existing prompting methods in multi-image tasks.
It shows robust improvements in challenging multi-image reasoning scenarios.
The method is effective across various open-source and closed-source models.
Abstract
Recently, Multimodal Large Language Models (MLLMs) encounter two key issues in multi-image contexts: (1) a lack of fine-grained perception across disparate images, and (2) a diminished capability to effectively reason over and synthesize information from multiple visual inputs. However, while various prompting methods aim to describe visual content, many existing studies focus primarily on single-image settings or specific, constrained scenarios. This leaves a critical gap in understanding and addressing how MLLMs tackle more general and complex multi-image reasoning tasks. Thus, we first extensively investigate how current prompting methods perceive fine-grained visual details and process visual information when dealing with multiple images. Our findings reveal that existing prompting methods fall short in attending to needed clues and seamlessly integrating perception and reasoning.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
