Hierarchical Vision-Language Reasoning for Multimodal Multiple-Choice Question Answering

Ao Zhou; Zebo Gu; Tenghao Sun; Jiawen Chen; Mingsheng Tu; Zifeng Cheng; Yafeng Yin; Zhiwei Jiang; Qing Gu

arXiv:2508.16148·cs.IR·August 25, 2025

Hierarchical Vision-Language Reasoning for Multimodal Multiple-Choice Question Answering

Ao Zhou, Zebo Gu, Tenghao Sun, Jiawen Chen, Mingsheng Tu, Zifeng Cheng, Yafeng Yin, Zhiwei Jiang, Qing Gu

PDF

TL;DR

This paper introduces a hierarchical reasoning framework for multimodal Japanese PDF document understanding, improving semantic parsing and robustness in complex, multilingual scenarios with a novel retrieval and verification strategy.

Contribution

It presents a new hierarchical reasoning approach combined with optimized retrieval and semantic verification for better multimodal document understanding in Japanese.

Findings

01

Significant improvement in deep semantic parsing of complex documents

02

Enhanced robustness in practical multimodal scenarios

03

Outperforms existing models on Japanese PDF understanding tasks

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated remarkable multimodal understanding capabilities in Visual Question Answering (VQA) tasks by integrating visual and textual features. However, under the challenging ten-choice question evaluation paradigm, existing methods still exhibit significant limitations when processing PDF documents with complex layouts and lengthy content. Notably, current mainstream models suffer from a strong bias toward English training data, resulting in suboptimal performance for Japanese and other language scenarios. To address these challenges, this paper proposes a novel Japanese PDF document understanding framework that combines multimodal hierarchical reasoning mechanisms with Colqwen-optimized retrieval methods, while innovatively introducing a semantic verification strategy through sub-question decomposition. Experimental results demonstrate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.