Multimodal Question Answering for Unified Information Extraction
Yuxuan Sun, Kai Zhang, Yu Su

TL;DR
This paper introduces a unified multimodal question answering framework that enhances information extraction from multimedia content, significantly improving performance across various models and settings, including zero-shot and few-shot scenarios.
Contribution
The paper proposes a novel MQA framework that unifies multiple MIE tasks into a single pipeline, improving model generalization and performance with large multimodal models.
Findings
Consistent performance improvements on six datasets
Outperforms state-of-the-art in zero-shot settings
Enhances smaller models to compete with larger ones like GPT-4
Abstract
Multimodal information extraction (MIE) aims to extract structured information from unstructured multimedia content. Due to the diversity of tasks and settings, most current MIE models are task-specific and data-intensive, which limits their generalization to real-world scenarios with diverse task requirements and limited labeled data. To address these issues, we propose a novel multimodal question answering (MQA) framework to unify three MIE tasks by reformulating them into a unified span extraction and multi-choice QA pipeline. Extensive experiments on six datasets show that: 1) Our MQA framework consistently and significantly improves the performances of various off-the-shelf large multimodal models (LMM) on MIE tasks, compared to vanilla prompting. 2) In the zero-shot setting, MQA outperforms previous state-of-the-art baselines by a large margin. In addition, the effectiveness of…
Peer Reviews
Decision·ICLR 2024 Conference Withdrawn Submission
1. The framework demonstrates impressive generalization capabilities and stability, outperforming traditional models and LMMs, including ChatGPT and GPT-4. 2. MQA effectively enhances the performance of LMMs in MIE tasks. 3. The method's straightforward nature, requiring only input reformatting for specific subtasks, is commendable.
1. The addition of an extra span extraction step for certain tasks like MNER adds complexity. Also, the framework focuses on only three MIE tasks, not covering the full range of multimodal tasks. 2. The QA-based reformulation method, though effective, is not a new concept in NLP. Moreover, the paper does not delve deeply into understanding the interactions between different modalities. 3. The technical novelty of MQA in comparison to existing QA-driven frameworks in NLP is not sufficiently highl
1. The proposed MQA achieves SOTA results on six datasets across three MIE subtasks, showcasing significant advancements. 2. The MQA framework exhibits impressive generalization and wide applicability. It effectively integrates with various LLMs, consistently outperforming their vanilla versions. Additionally, during robustness testing, MQA displays relatively small performance variation under different prompting strategies and input orders, underscoring its robustness and adaptability. 3. The
1. To unify MNER, MRE, and MED tasks, an additional span extraction is introduced for some tasks like MNER & MTED, which adds extra complexity to the overall system.
This paper makes a simple yet somehow effective attempt to unify various MIE tasks. This paper has performed relatively extensive experiments under the setting of different MIE tasks and LMM scales. This paper is well-written and quite easy to follow.
The main weaknesses are three-fold: Overall, this paper lacks novelty and makes limited contributions. QA-based reformulation (both span-based and multiple-choice) is one of the most typical and long-standing formats for unifying various IE tasks in the NLP community, especially in the era of large-scale models [1-4]. Although this paper targets MIE and incorporates another modality, i.e. an input image paired with the text, it does not pay more attention to understanding the image content and i
1. The paper proposes to unify three Multimodal Information Extraction (MIE) tasks (e.g., multimodal named entity extraction, multimodal relation extraction, and multimodal event detection) into a single pipeline, offering a more efficient and generalizable solution. 2. The MQA framework outperforms baseline models in terms of efficacy and generalization across multiple datasets.
1. The running title might be problematic or misleading. There seems to be a disconnect between the title and the content of the paper. While the title underscores the theme of unified information extraction, the body of the paper leans more towards the unification of different modalities in information extraction. Furthermore, the current approach brings together just three MIE tasks, leaving out others like multimodal aspect term extraction [1] and multimodal opinion extraction [2], which are
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Dropout · Dense Connections · Linear Layer · Label Smoothing · Adam · Absolute Position Encodings · Residual Connection · Layer Normalization
