SpreadCluster: Recovering Versioned Spreadsheets through Similarity-Based Clustering
Liang Xu, Wensheng Dou, Chushu Gao, Jie Wang, Jun Wei, Hua Zhong, Tao, Huang

TL;DR
SpreadCluster is an automated clustering method that groups versioned spreadsheets based on learned similarity features, significantly improving accuracy over filename-based methods and enabling the creation of larger, more comprehensive versioned spreadsheet corpora.
Contribution
The paper introduces SpreadCluster, a novel clustering algorithm that automatically identifies versioned spreadsheets using feature similarity, outperforming existing filename-based approaches.
Findings
SpreadCluster achieves higher precision and recall than filename-based clustering.
It successfully clusters spreadsheets across multiple corpora, including VEnron, FUSE, and EUSES.
The resulting VEnron2 corpus is significantly larger and more comprehensive.
Abstract
Version information plays an important role in spreadsheet understanding, maintaining and quality improving. However, end users rarely use version control tools to document spreadsheet version information. Thus, the spreadsheet version information is missing, and different versions of a spreadsheet coexist as individual and similar spreadsheets. Existing approaches try to recover spreadsheet version information through clustering these similar spreadsheets based on spreadsheet filenames or related email conversation. However, the applicability and accuracy of existing clustering approaches are limited due to the necessary information (e.g., filenames and email conversation) is usually missing. We inspected the versioned spreadsheets in VEnron, which is extracted from the Enron Corporation. In VEnron, the different versions of a spreadsheet are clustered into an evolution group. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpreadsheets and End-User Computing
