Audio-Oriented Multimodal Machine Comprehension: Task, Dataset and Model

Zhiqi Huang; Fenglin Liu; Xian Wu; Shen Ge; Helin Wang; Wei Fan,; Yuexian Zou

arXiv:2107.01571·cs.CL·July 6, 2021

Audio-Oriented Multimodal Machine Comprehension: Task, Dataset and Model

Zhiqi Huang, Fenglin Liu, Xian Wu, Shen Ge, Helin Wang, Wei Fan,, Yuexian Zou

PDF

Open Access

TL;DR

This paper introduces a multimodal machine comprehension model that fuses audio and textual data, enabling it to perform well on various MC tasks and outperform unimodal models through novel attention and knowledge distillation techniques.

Contribution

It proposes the DIIA model for effective audio-text fusion and the MKD module for unimodal prediction, advancing multimodal MC capabilities.

Findings

01

DIIA improves accuracy by up to 21.08%.

02

MKD enables the model to outperform unimodal models by up to 18.87%.

03

The model handles multiple MC tasks with a single architecture.

Abstract

While Machine Comprehension (MC) has attracted extensive research interests in recent years, existing approaches mainly belong to the category of Machine Reading Comprehension task which mines textual inputs (paragraphs and questions) to predict the answers (choices or text spans). However, there are a lot of MC tasks that accept audio input in addition to the textual input, e.g. English listening comprehension test. In this paper, we target the problem of Audio-Oriented Multimodal Machine Comprehension, and its goal is to answer questions based on the given audio and textual information. To solve this problem, we propose a Dynamic Inter- and Intra-modality Attention (DIIA) model to effectively fuse the two modalities (audio and textual). DIIA can work as an independent component and thus be easily integrated into existing MC models. Moreover, we further develop a Multimodal Knowledge…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsKnowledge Distillation