Auto-Tag: Tagging-Data-By-Example in Data Lakes
Yeye He, Jie Song, Yue Wang, Surajit Chaudhuri, Vishal Anil, Blake, Lassiter, Yaron Goland, Gaurav Malhotra

TL;DR
Auto-Tag is a lightweight, interactive approach that automatically tags custom data types in enterprise data lakes using minimal user input and an offline index, enhancing data governance and search capabilities.
Contribution
It introduces Auto-Tag, a novel corpus-driven method that infers data patterns from a single example to accurately and efficiently tag custom data types in large-scale data lakes.
Findings
Auto-Tag achieves high accuracy in data-type tagging.
The approach is efficient and suitable for large enterprise data lakes.
Auto-Tag integrates with Azure Purview for practical deployment.
Abstract
As data lakes become increasingly popular in large enterprises today, there is a growing need to tag or classify data assets (e.g., files and databases) in data lakes with additional metadata (e.g., semantic column-types), as the inferred metadata can enable a range of downstream applications like data governance (e.g., GDPR compliance), and dataset search. Given the sheer size of today's enterprise data lakes with petabytes of data and millions of data assets, it is imperative that data assets can be ``auto-tagged'', using lightweight inference algorithms and minimal user input. In this work, we develop Auto-Tag, a corpus-driven approach that automates data-tagging of \textit{custom} data types in enterprise data lakes. Using Auto-Tag, users only need to provide \textit{one} example column to demonstrate the desired data-type to tag. Leveraging an index structure built offline using a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Big Data and Business Intelligence · Cloud Data Security Solutions
