Vision Language Models for Spreadsheet Understanding: Challenges and   Opportunities

Shiyu Xia; Junyu Xiong; Haoyu Dong; Jianbo Zhao; Yuzhang Tian; Mengyu; Zhou; Yeye He; Shi Han; Dongmei Zhang

arXiv:2405.16234·cs.CV·September 27, 2024

Vision Language Models for Spreadsheet Understanding: Challenges and Opportunities

Shiyu Xia, Junyu Xiong, Haoyu Dong, Jianbo Zhao, Yuzhang Tian, Mengyu, Zhou, Yeye He, Shi Han, Dongmei Zhang

PDF

Open Access

TL;DR

This paper evaluates Vision Language Models on spreadsheet understanding tasks, identifying their strengths in OCR and weaknesses in spatial and format recognition, and proposes challenges and evaluation metrics for comprehensive assessment.

Contribution

It introduces three self-supervised challenges with metrics for evaluating VLMs on spreadsheet comprehension and proposes new settings and prompts to probe their capabilities.

Findings

01

VLMs show promising OCR performance.

02

VLMs have poor spatial and format recognition.

03

Proposed methods generate diverse spreadsheet-image pairs.

Abstract

This paper explores capabilities of Vision Language Models on spreadsheet comprehension. We propose three self-supervised challenges with corresponding evaluation metrics to comprehensively evaluate VLMs on Optical Character Recognition (OCR), spatial perception, and visual format recognition. Additionally, we utilize the spreadsheet table detection task to assess the overall performance of VLMs by integrating these challenges. To probe VLMs more finely, we propose three spreadsheet-to-image settings: column width adjustment, style change, and address augmentation. We propose variants of prompts to address the above tasks in different settings. Notably, to leverage the strengths of VLMs in understanding text rather than two-dimensional positioning, we propose to decode cell values on the four boundaries of the table in spreadsheet boundary detection. Our findings reveal that VLMs…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpreadsheets and End-User Computing