TL;DR
This paper introduces Mondrian, an automated method for detecting layout templates in complex spreadsheets with multiple regions, improving the identification of recurring structures across files.
Contribution
The paper presents a novel three-phase approach combining image rendering, clustering, and graph comparison to identify layout templates in multiregion spreadsheets, outperforming existing algorithms.
Findings
Effective detection of region boundaries within files.
Successful identification of recurring layouts across multiple spreadsheets.
Outperforms state-of-the-art table recognition algorithms.
Abstract
Spreadsheets are among the most commonly used file formats for data management, distribution, and analysis. Their widespread employment makes it easy to gather large collections of data, but their flexible canvas-based structure makes automated analysis difficult without heavy preparation. One of the common problems that practitioners face is the presence of multiple, independent regions in a single spreadsheet, possibly separated by repeated empty cells. We define such files as "multiregion" files. In collections of various spreadsheets, we can observe that some share the same layout. We present the Mondrian approach to automatically identify layout templates across multiple files and systematically extract the corresponding regions. Our approach is composed of three phases: first, each file is rendered as an image and inspected for elements that could form regions; then, using a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
