MoCA: Incorporating Multi-stage Domain Pretraining and Cross-guided Multimodal Attention for Textbook Question Answering
Fangzhi Xu, Qika Lin, Jun Liu, Lingling Zhang, Tianzhe Zhao, Qi Chai,, Yudai Pan

TL;DR
This paper introduces MoCA, a novel multimodal model for Textbook Question Answering that leverages multi-stage domain pretraining and cross-guided attention to improve understanding of complex, domain-specific, multimodal inputs.
Contribution
MoCA combines multi-stage domain pretraining with a cross-guided multimodal attention mechanism to enhance TQA performance, addressing domain-specific terminology and complex multimodal fusion challenges.
Findings
Outperforms state-of-the-art methods by over 2% on validation and test sets.
Effective in handling domain-specific terminology and complex multimodal inputs.
Demonstrates significant improvements in TQA accuracy.
Abstract
Textbook Question Answering (TQA) is a complex multimodal task to infer answers given large context descriptions and abundant diagrams. Compared with Visual Question Answering (VQA), TQA contains a large number of uncommon terminologies and various diagram inputs. It brings new challenges to the representation capability of language model for domain-specific spans. And it also pushes the multimodal fusion to a more complex level. To tackle the above issues, we propose a novel model named MoCA, which incorporates multi-stage domain pretraining and multimodal cross attention for the TQA task. Firstly, we introduce a multi-stage domain pretraining module to conduct unsupervised post-pretraining with the span mask strategy and supervised pre-finetune. Especially for domain post-pretraining, we propose a heuristic generation algorithm to employ the terminology corpus. Secondly, to fully…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
