ISAAQ -- Mastering Textbook Questions with Pre-trained Transformers and   Bottom-Up and Top-Down Attention

Jose Manuel Gomez-Perez; Raul Ortega

arXiv:2010.00562·cs.CL·October 2, 2020

ISAAQ -- Mastering Textbook Questions with Pre-trained Transformers and Bottom-Up and Top-Down Attention

Jose Manuel Gomez-Perez, Raul Ortega

PDF

TL;DR

This paper introduces ISAAQ, a system that combines pre-trained transformers with bottom-up and top-down attention mechanisms to improve multimodal reasoning in textbook question answering, achieving state-of-the-art results.

Contribution

It is the first to effectively integrate pre-trained transformers with bottom-up and top-down attention for multimodal textbook question answering.

Findings

01

Achieved 81.36% accuracy on true/false questions.

02

Attained 71.11% accuracy on text-only questions.

03

Reached 55.12% accuracy on diagram multiple choice questions.

Abstract

Textbook Question Answering is a complex task in the intersection of Machine Comprehension and Visual Question Answering that requires reasoning with multimodal information from text and diagrams. For the first time, this paper taps on the potential of transformer language models and bottom-up and top-down attention to tackle the language and visual understanding challenges this task entails. Rather than training a language-visual transformer from scratch we rely on pre-trained transformers, fine-tuning and ensembling. We add bottom-up and top-down attention to identify regions of interest corresponding to diagram constituents and their relationships, improving the selection of relevant visual information for each question and answer options. Our system ISAAQ reports unprecedented success in all TQA question types, with accuracies of 81.36%, 71.11% and 55.12% on true/false, text-only…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.