M-Longdoc: A Benchmark For Multimodal Super-Long Document Understanding And A Retrieval-Aware Tuning Framework
Yew Ken Chia, Liying Cheng, Hou Pong Chan, Chaoqun Liu, Maojia Song,, Sharifah Mahani Aljunied, Soujanya Poria, Lidong Bing

TL;DR
This paper introduces M-LongDoc, a comprehensive benchmark and a retrieval-aware tuning framework for understanding and answering questions on multimodal, long, and diverse documents, improving model performance in this challenging setting.
Contribution
It presents the first retrieval-aware tuning framework for multimodal long documents and a new benchmark with recent, lengthy, and open-ended question-answering tasks.
Findings
Benchmark with 851 samples of lengthy multimodal documents.
Tuning approach improves answer correctness by 4.6%.
Provides open-source data, code, and models.
Abstract
The ability to understand and answer questions over documents can be useful in many business and practical applications. However, documents often contain lengthy and diverse multimodal contents such as texts, figures, and tables, which are very time-consuming for humans to read thoroughly. Hence, there is an urgent need to develop effective and automated methods to aid humans in this task. In this work, we introduce M-LongDoc, a benchmark of 851 samples, and an automated framework to evaluate the performance of large multimodal models. We further propose a retrieval-aware tuning approach for efficient and effective multimodal document reading. Compared to existing works, our benchmark consists of more recent and lengthy documents with hundreds of pages, while also requiring open-ended solutions and not just extractive answers. To our knowledge, our training framework is the first to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
