Machamp: A Generalized Entity Matching Benchmark
Jin Wang, Yuliang Li, Wataru Hirota

TL;DR
Machamp introduces a comprehensive benchmark for generalized entity matching across diverse data structures, enabling evaluation of matching techniques beyond traditional structured table scenarios, thus advancing research in real-world data integration.
Contribution
The paper presents Machamp, a new benchmark with seven diverse tasks for generalized entity matching across various data structures, filling a gap in existing EM evaluation methods.
Findings
Existing EM benchmarks are limited to structured tables.
Machamp enables evaluation across structured, semi-structured, and unstructured data.
Popular EM approaches are evaluated on Machamp, revealing their strengths and limitations.
Abstract
Entity Matching (EM) refers to the problem of determining whether two different data representations refer to the same real-world entity. It has been a long-standing interest of the data management community and many efforts have been paid in creating benchmark tasks as well as in developing advanced matching techniques. However, existing benchmark tasks for EM are limited to the case where the two data collections of entities are structured tables with the same schema. Meanwhile, the data collections for matching could be structured, semi-structured, or unstructured in real-world scenarios of data science. In this paper, we come up with a new research problem -- Generalized Entity Matching to satisfy this requirement and create a benchmark Machamp for it. Machamp consists of seven tasks having diverse characteristics and thus provides good coverage of use cases in real applications. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Topic Modeling · Web Data Mining and Analysis
