Multimodal Question Answering for Unified Information Extraction

Yuxuan Sun; Kai Zhang; Yu Su

arXiv:2310.03017·cs.CL·October 5, 2023·2 cites

Multimodal Question Answering for Unified Information Extraction

Yuxuan Sun, Kai Zhang, Yu Su

PDF

Open Access 1 Repo 4 Reviews

TL;DR

This paper introduces a unified multimodal question answering framework that enhances information extraction from multimedia content, significantly improving performance across various models and settings, including zero-shot and few-shot scenarios.

Contribution

The paper proposes a novel MQA framework that unifies multiple MIE tasks into a single pipeline, improving model generalization and performance with large multimodal models.

Findings

01

Consistent performance improvements on six datasets

02

Outperforms state-of-the-art in zero-shot settings

03

Enhances smaller models to compete with larger ones like GPT-4

Abstract

Multimodal information extraction (MIE) aims to extract structured information from unstructured multimedia content. Due to the diversity of tasks and settings, most current MIE models are task-specific and data-intensive, which limits their generalization to real-world scenarios with diverse task requirements and limited labeled data. To address these issues, we propose a novel multimodal question answering (MQA) framework to unify three MIE tasks by reformulating them into a unified span extraction and multi-choice QA pipeline. Extensive experiments on six datasets show that: 1) Our MQA framework consistently and significantly improves the performances of various off-the-shelf large multimodal models (LMM) on MIE tasks, compared to vanilla prompting. 2) In the zero-shot setting, MQA outperforms previous state-of-the-art baselines by a large margin. In addition, the effectiveness of…

Peer Reviews

Decision·ICLR 2024 Conference Withdrawn Submission

Reviewer 01Rating 5· marginally below the acceptance thresholdConfidence 3

Strengths

1. The framework demonstrates impressive generalization capabilities and stability, outperforming traditional models and LMMs, including ChatGPT and GPT-4. 2. MQA effectively enhances the performance of LMMs in MIE tasks. 3. The method's straightforward nature, requiring only input reformatting for specific subtasks, is commendable.

Weaknesses

1. The addition of an extra span extraction step for certain tasks like MNER adds complexity. Also, the framework focuses on only three MIE tasks, not covering the full range of multimodal tasks. 2. The QA-based reformulation method, though effective, is not a new concept in NLP. Moreover, the paper does not delve deeply into understanding the interactions between different modalities. 3. The technical novelty of MQA in comparison to existing QA-driven frameworks in NLP is not sufficiently highl

Reviewer 02Rating 8· accept, good paperConfidence 4

Strengths

1. The proposed MQA achieves SOTA results on six datasets across three MIE subtasks, showcasing significant advancements. 2. The MQA framework exhibits impressive generalization and wide applicability. It effectively integrates with various LLMs, consistently outperforming their vanilla versions. Additionally, during robustness testing, MQA displays relatively small performance variation under different prompting strategies and input orders, underscoring its robustness and adaptability. 3. The

Weaknesses

1. To unify MNER, MRE, and MED tasks, an additional span extraction is introduced for some tasks like MNER & MTED, which adds extra complexity to the overall system.

Reviewer 03Rating 3· reject, not good enoughConfidence 5

Strengths

This paper makes a simple yet somehow effective attempt to unify various MIE tasks. This paper has performed relatively extensive experiments under the setting of different MIE tasks and LMM scales. This paper is well-written and quite easy to follow.

Weaknesses

The main weaknesses are three-fold: Overall, this paper lacks novelty and makes limited contributions. QA-based reformulation (both span-based and multiple-choice) is one of the most typical and long-standing formats for unifying various IE tasks in the NLP community, especially in the era of large-scale models [1-4]. Although this paper targets MIE and incorporates another modality, i.e. an input image paired with the text, it does not pay more attention to understanding the image content and i

Reviewer 04Rating 3· reject, not good enoughConfidence 4

Strengths

1. The paper proposes to unify three Multimodal Information Extraction (MIE) tasks (e.g., multimodal named entity extraction, multimodal relation extraction, and multimodal event detection) into a single pipeline, offering a more efficient and generalizable solution. 2. The MQA framework outperforms baseline models in terms of efficacy and generalization across multiple datasets.

Weaknesses

1. The running title might be problematic or misleading. There seems to be a disconnect between the title and the content of the paper. While the title underscores the theme of unified information extraction, the body of the paper leans more towards the unification of different modalities in information extraction. Furthermore, the current approach brings together just three MIE tasks, leaving out others like multimodal aspect term extraction [1] and multimodal opinion extraction [2], which are

Code & Models

Repositories

osu-nlp-group/mqa
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Dropout · Dense Connections · Linear Layer · Label Smoothing · Adam · Absolute Position Encodings · Residual Connection · Layer Normalization