MMHQA-ICL: Multimodal In-context Learning for Hybrid Question Answering over Text, Tables and Images
Weihao Liu, Fangyu Lei, Tongxu Luo, Jiahe Lei, Shizhu He, Jun Zhao and, Kang Liu

TL;DR
This paper introduces MMHQA-ICL, a novel framework utilizing in-context learning with LLMs for hybrid question answering over text, tables, and images, achieving state-of-the-art results in few-shot settings.
Contribution
It presents the first end-to-end LLM prompting method for multimodal hybrid QA, incorporating a heterogeneous data retriever, image captioning, and type-specific in-context learning strategies.
Findings
Outperforms all baselines on MultimodalQA dataset
Achieves state-of-the-art results in few-shot learning
Demonstrates effectiveness of end-to-end LLM prompting for multimodal QA
Abstract
In the real world, knowledge often exists in a multimodal and heterogeneous form. Addressing the task of question answering with hybrid data types, including text, tables, and images, is a challenging task (MMHQA). Recently, with the rise of large language models (LLM), in-context learning (ICL) has become the most popular way to solve QA problems. We propose MMHQA-ICL framework for addressing this problems, which includes stronger heterogeneous data retriever and an image caption module. Most importantly, we propose a Type-specific In-context Learning Strategy for MMHQA, enabling LLMs to leverage their powerful performance in this task. We are the first to use end-to-end LLM prompting method for this task. Experimental results demonstrate that our framework outperforms all baselines and methods trained on the full dataset, achieving state-of-the-art results under the few-shot setting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
