Towards Cross-Table Masked Pretraining for Web Data Mining

Chao Ye; Guoshan Lu; Haobo Wang; Liyao Li; Sai Wu; Gang Chen; Junbo; Zhao

arXiv:2307.04308·cs.LG·February 2, 2024·1 cites

Towards Cross-Table Masked Pretraining for Web Data Mining

Chao Ye, Guoshan Lu, Haobo Wang, Liyao Li, Sai Wu, Gang Chen, Junbo, Zhao

PDF

Open Access 2 Repos

TL;DR

This paper introduces a novel cross-table pretraining framework called CM2 for web data mining, addressing the challenge of heterogeneous tabular data and demonstrating improved performance on downstream tasks.

Contribution

It presents a new dataset, a semantic-aware neural network, and a tailored pretraining objective for cross-table tabular data modeling.

Findings

01

CM2 achieves state-of-the-art results on multiple tasks.

02

Cross-table pretraining enhances downstream task performance.

03

The proposed framework effectively encodes heterogeneous tables.

Abstract

Tabular data pervades the landscape of the World Wide Web, playing a foundational role in the digital architecture that underpins online information. Given the recent influence of large-scale pretrained models like ChatGPT and SAM across various domains, exploring the application of pretraining techniques for mining tabular data on the web has emerged as a highly promising research direction. Indeed, there have been some recent works around this topic where most (if not all) of them are limited in the scope of a fixed-schema/single table. Due to the scale of the dataset and the parameter size of the prior models, we believe that we have not reached the ''BERT moment'' for the ubiquitous tabular data. The development on this line significantly lags behind the counterpart research domains such as natural language processing. In this work, we first identify the crucial challenges behind…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Text Readability and Simplification · Topic Modeling

MethodsSegment Anything Model