Enhancing Large Vision-Language Models with Layout Modality for Table Question Answering on Japanese Annual Securities Reports
Hayato Aida, Kosuke Takahashi, Takahiro Omi

TL;DR
This paper introduces a method to improve large vision-language models for understanding tables in Japanese securities reports by incorporating layout and textual features, significantly enhancing question-answering accuracy.
Contribution
The study proposes a novel approach to augment LVLMs with layout and in-table text features, addressing current limitations in understanding complex document structures.
Findings
Enhanced model performance on Japanese securities report tables
Improved accuracy in question answering over complex layouts
Robust interpretation without structured input formats
Abstract
With recent advancements in Large Language Models (LLMs) and growing interest in retrieval-augmented generation (RAG), the ability to understand table structures has become increasingly important. This is especially critical in financial domains such as securities reports, where highly accurate question answering (QA) over tables is required. However, tables exist in various formats-including HTML, images, and plain text-making it difficult to preserve and extract structural information. Therefore, multimodal LLMs are essential for robust and general-purpose table understanding. Despite their promise, current Large Vision-Language Models (LVLMs), which are major representatives of multimodal LLMs, still face challenges in accurately understanding characters and their spatial relationships within documents. In this study, we propose a method to enhance LVLM-based table understanding by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
