Vision Language Models for Spreadsheet Understanding: Challenges and Opportunities
Shiyu Xia, Junyu Xiong, Haoyu Dong, Jianbo Zhao, Yuzhang Tian, Mengyu, Zhou, Yeye He, Shi Han, Dongmei Zhang

TL;DR
This paper evaluates Vision Language Models on spreadsheet understanding tasks, identifying their strengths in OCR and weaknesses in spatial and format recognition, and proposes challenges and evaluation metrics for comprehensive assessment.
Contribution
It introduces three self-supervised challenges with metrics for evaluating VLMs on spreadsheet comprehension and proposes new settings and prompts to probe their capabilities.
Findings
VLMs show promising OCR performance.
VLMs have poor spatial and format recognition.
Proposed methods generate diverse spreadsheet-image pairs.
Abstract
This paper explores capabilities of Vision Language Models on spreadsheet comprehension. We propose three self-supervised challenges with corresponding evaluation metrics to comprehensively evaluate VLMs on Optical Character Recognition (OCR), spatial perception, and visual format recognition. Additionally, we utilize the spreadsheet table detection task to assess the overall performance of VLMs by integrating these challenges. To probe VLMs more finely, we propose three spreadsheet-to-image settings: column width adjustment, style change, and address augmentation. We propose variants of prompts to address the above tasks in different settings. Notably, to leverage the strengths of VLMs in understanding text rather than two-dimensional positioning, we propose to decode cell values on the four boundaries of the table in spreadsheet boundary detection. Our findings reveal that VLMs…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpreadsheets and End-User Computing
