Multimodal Table Understanding

Mingyu Zheng; Xinwei Feng; Qingyi Si; Qiaoqiao She; Zheng Lin; Wenbin; Jiang; Weiping Wang

arXiv:2406.08100·cs.CL·June 13, 2024

Multimodal Table Understanding

Mingyu Zheng, Xinwei Feng, Qingyi Si, Qiaoqiao She, Zheng Lin, Wenbin, Jiang, Weiping Wang

PDF

Open Access 1 Repo 5 Models 3 Datasets 1 Video

TL;DR

This paper introduces multimodal table understanding, enabling models to interpret table images directly, and presents a new dataset and model that outperform existing baselines on various benchmarks.

Contribution

It proposes the first approach to directly understand tables from images, along with a large-scale dataset and a multimodal large language model for tables.

Findings

01

Table-LLaVA outperforms recent open-source MLLMs on 23 benchmarks.

02

The MMTab dataset covers diverse table images, instructions, and tasks.

03

The approach enables practical table understanding without relying on textual conversions.

Abstract

Although great progress has been made by previous table understanding methods including recent approaches based on large language models (LLMs), they rely heavily on the premise that given tables must be converted into a certain text sequence (such as Markdown or HTML) to serve as model input. However, it is difficult to access such high-quality textual table representations in some real-world scenarios, and table images are much more accessible. Therefore, how to directly understand tables using intuitive visual information is a crucial and urgent challenge for developing more practical applications. In this paper, we propose a new problem, multimodal table understanding, where the model needs to generate correct responses to various table-related requests based on the given table image. To facilitate both the model training and evaluation, we construct a large-scale dataset named…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

spursgozmy/table-llava
pytorchOfficial

Models

Datasets

Videos

Multimodal Table Understanding· underline

Taxonomy

TopicsHandwritten Text Recognition Techniques