ISAAQ -- Mastering Textbook Questions with Pre-trained Transformers and Bottom-Up and Top-Down Attention
Jose Manuel Gomez-Perez, Raul Ortega

TL;DR
This paper introduces ISAAQ, a system that combines pre-trained transformers with bottom-up and top-down attention mechanisms to improve multimodal reasoning in textbook question answering, achieving state-of-the-art results.
Contribution
It is the first to effectively integrate pre-trained transformers with bottom-up and top-down attention for multimodal textbook question answering.
Findings
Achieved 81.36% accuracy on true/false questions.
Attained 71.11% accuracy on text-only questions.
Reached 55.12% accuracy on diagram multiple choice questions.
Abstract
Textbook Question Answering is a complex task in the intersection of Machine Comprehension and Visual Question Answering that requires reasoning with multimodal information from text and diagrams. For the first time, this paper taps on the potential of transformer language models and bottom-up and top-down attention to tackle the language and visual understanding challenges this task entails. Rather than training a language-visual transformer from scratch we rely on pre-trained transformers, fine-tuning and ensembling. We add bottom-up and top-down attention to identify regions of interest corresponding to diagram constituents and their relationships, improving the selection of relevant visual information for each question and answer options. Our system ISAAQ reports unprecedented success in all TQA question types, with accuracies of 81.36%, 71.11% and 55.12% on true/false, text-only…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
