LakeHopper: Cross Data Lakes Column Type Annotation through Model Adaptation
Yushi Sun, Xujia Li, Nan Tang, Quanqing Xu, Chuanhui Yang, Lei Chen

TL;DR
LakeHopper is a framework that adapts pre-trained language models for column type annotation across different data lakes, reducing the need for extensive new annotations through model interaction, data selection, and incremental fine-tuning.
Contribution
We introduce LakeHopper, a novel approach for transferring pre-trained models to new data lakes with minimal annotations, addressing knowledge gaps and optimizing fine-tuning.
Findings
Effective transfer of models across data lakes demonstrated
Reduces annotation effort in new data lake environments
Improves accuracy with incremental fine-tuning
Abstract
Column type annotation is vital for tasks like data cleaning, integration, and visualization. Recent solutions rely on resource-intensive language models fine-tuned on well-annotated columns from a particular set of tables, i.e., a source data lake. In this paper, we study whether we can adapt an existing pre-trained LM-based model to a new (i.e., target) data lake to minimize the annotations required on the new data lake. However, challenges include the source-target knowledge gap, selecting informative target data, and fine-tuning without losing shared knowledge exist. We propose LakeHopper, a framework that identifies and resolves the knowledge gap through LM interactions, employs a cluster-based data selection scheme for unannotated columns, and uses an incremental fine-tuning mechanism that gradually adapts the source model to the target data lake. Our experimental results validate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Natural Language Processing Techniques · Topic Modeling
