Multimodal Table Understanding
Mingyu Zheng, Xinwei Feng, Qingyi Si, Qiaoqiao She, Zheng Lin, Wenbin, Jiang, Weiping Wang

TL;DR
This paper introduces multimodal table understanding, enabling models to interpret table images directly, and presents a new dataset and model that outperform existing baselines on various benchmarks.
Contribution
It proposes the first approach to directly understand tables from images, along with a large-scale dataset and a multimodal large language model for tables.
Findings
Table-LLaVA outperforms recent open-source MLLMs on 23 benchmarks.
The MMTab dataset covers diverse table images, instructions, and tasks.
The approach enables practical table understanding without relying on textual conversions.
Abstract
Although great progress has been made by previous table understanding methods including recent approaches based on large language models (LLMs), they rely heavily on the premise that given tables must be converted into a certain text sequence (such as Markdown or HTML) to serve as model input. However, it is difficult to access such high-quality textual table representations in some real-world scenarios, and table images are much more accessible. Therefore, how to directly understand tables using intuitive visual information is a crucial and urgent challenge for developing more practical applications. In this paper, we propose a new problem, multimodal table understanding, where the model needs to generate correct responses to various table-related requests based on the given table image. To facilitate both the model training and evaluation, we construct a large-scale dataset named…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsHandwritten Text Recognition Techniques
