BEACON: Budget-Aware Entity Matching Across Domains (Extended Technical Report)
Nicholas Pulsone, Roee Shraga, Gregory Goren

TL;DR
BEACON is a novel framework that improves entity matching accuracy in low-resource domains by intelligently selecting out-of-domain samples, effectively leveraging limited labeled data and outperforming existing methods.
Contribution
The paper introduces BEACON, a distribution-aware, budget-aware approach for low-resource entity matching that utilizes embedding representations to guide sample selection across domains.
Findings
BEACON outperforms state-of-the-art methods across multiple datasets.
It effectively leverages limited in-domain labels with out-of-domain data.
The approach is robust under various training budgets.
Abstract
Entity Matching (EM)--the task of determining whether two data records refer to the same real-world entity--is a core task in data integration. Recent advances in deep learning have set a new standard for EM, particularly through fine-tuning Pretrained Language Models (PLMs) and, more recently, Large Language Models (LLMs). However, fine-tuning typically requires large amounts of labeled data, which are expensive and time-consuming to obtain. In the context of e-commerce matching, labeling scarcity varies widely across domains, raising the question of how to intelligently train accurate domain-specific EM models with limited labeled data. In this work we assume users have only a limited amount of labels for a specific target domain but have access to labeled data from other domains. We introduce BEACON, a distribution-aware, budget-aware framework for low-resource EM across domains.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Topic Modeling · Advanced Graph Neural Networks
