Enhancing Large Vision-Language Models with Layout Modality for Table Question Answering on Japanese Annual Securities Reports

Hayato Aida; Kosuke Takahashi; Takahiro Omi

arXiv:2505.17625·cs.CL·May 26, 2025

Enhancing Large Vision-Language Models with Layout Modality for Table Question Answering on Japanese Annual Securities Reports

Hayato Aida, Kosuke Takahashi, Takahiro Omi

PDF

1 Datasets

TL;DR

This paper introduces a method to improve large vision-language models for understanding tables in Japanese securities reports by incorporating layout and textual features, significantly enhancing question-answering accuracy.

Contribution

The study proposes a novel approach to augment LVLMs with layout and in-table text features, addressing current limitations in understanding complex document structures.

Findings

01

Enhanced model performance on Japanese securities report tables

02

Improved accuracy in question answering over complex layouts

03

Robust interpretation without structured input formats

Abstract

With recent advancements in Large Language Models (LLMs) and growing interest in retrieval-augmented generation (RAG), the ability to understand table structures has become increasingly important. This is especially critical in financial domains such as securities reports, where highly accurate question answering (QA) over tables is required. However, tables exist in various formats-including HTML, images, and plain text-making it difficult to preserve and extract structural information. Therefore, multimodal LLMs are essential for robust and general-purpose table understanding. Despite their promise, current Large Vision-Language Models (LVLMs), which are major representatives of multimodal LLMs, still face challenges in accurately understanding characters and their spatial relationships within documents. In this study, we propose a method to enhance LVLM-based table understanding by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

stockmark/u4-table-cell-qa
dataset· 104 dl
104 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.