Towards Cross-Table Masked Pretraining for Web Data Mining
Chao Ye, Guoshan Lu, Haobo Wang, Liyao Li, Sai Wu, Gang Chen, Junbo, Zhao

TL;DR
This paper introduces a novel cross-table pretraining framework called CM2 for web data mining, addressing the challenge of heterogeneous tabular data and demonstrating improved performance on downstream tasks.
Contribution
It presents a new dataset, a semantic-aware neural network, and a tailored pretraining objective for cross-table tabular data modeling.
Findings
CM2 achieves state-of-the-art results on multiple tasks.
Cross-table pretraining enhances downstream task performance.
The proposed framework effectively encodes heterogeneous tables.
Abstract
Tabular data pervades the landscape of the World Wide Web, playing a foundational role in the digital architecture that underpins online information. Given the recent influence of large-scale pretrained models like ChatGPT and SAM across various domains, exploring the application of pretraining techniques for mining tabular data on the web has emerged as a highly promising research direction. Indeed, there have been some recent works around this topic where most (if not all) of them are limited in the scope of a fixed-schema/single table. Due to the scale of the dataset and the parameter size of the prior models, we believe that we have not reached the ''BERT moment'' for the ubiquitous tabular data. The development on this line significantly lags behind the counterpart research domains such as natural language processing. In this work, we first identify the crucial challenges behind…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Text Readability and Simplification · Topic Modeling
MethodsSegment Anything Model
