SpreadCluster: Recovering Versioned Spreadsheets through   Similarity-Based Clustering

Liang Xu; Wensheng Dou; Chushu Gao; Jie Wang; Jun Wei; Hua Zhong; Tao; Huang

arXiv:1704.08476·cs.SE·April 28, 2017·1 cites

SpreadCluster: Recovering Versioned Spreadsheets through Similarity-Based Clustering

Liang Xu, Wensheng Dou, Chushu Gao, Jie Wang, Jun Wei, Hua Zhong, Tao, Huang

PDF

Open Access

TL;DR

SpreadCluster is an automated clustering method that groups versioned spreadsheets based on learned similarity features, significantly improving accuracy over filename-based methods and enabling the creation of larger, more comprehensive versioned spreadsheet corpora.

Contribution

The paper introduces SpreadCluster, a novel clustering algorithm that automatically identifies versioned spreadsheets using feature similarity, outperforming existing filename-based approaches.

Findings

01

SpreadCluster achieves higher precision and recall than filename-based clustering.

02

It successfully clusters spreadsheets across multiple corpora, including VEnron, FUSE, and EUSES.

03

The resulting VEnron2 corpus is significantly larger and more comprehensive.

Abstract

Version information plays an important role in spreadsheet understanding, maintaining and quality improving. However, end users rarely use version control tools to document spreadsheet version information. Thus, the spreadsheet version information is missing, and different versions of a spreadsheet coexist as individual and similar spreadsheets. Existing approaches try to recover spreadsheet version information through clustering these similar spreadsheets based on spreadsheet filenames or related email conversation. However, the applicability and accuracy of existing clustering approaches are limited due to the necessary information (e.g., filenames and email conversation) is usually missing. We inspected the versioned spreadsheets in VEnron, which is extracted from the Enron Corporation. In VEnron, the different versions of a spreadsheet are clustered into an evolution group. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpreadsheets and End-User Computing