A Tight Expressivity Hierarchy for GNN-Based Entity Resolution in Master Data Management
Ashwin Ganesan

TL;DR
This paper establishes a theoretical hierarchy for the complexity of GNN architectures in entity resolution, identifying minimal neural network structures needed for different resolution tasks.
Contribution
It introduces a formal separation theory with tight bounds for GNN capabilities in entity resolution, guiding practitioners on minimal architecture requirements.
Findings
Detecting shared attributes is a local problem requiring 2-layer reverse message passing.
Detecting multiple shared attributes or cycles requires 4-layer ego ID mechanisms.
The results provide a minimal-architecture principle for efficient GNN design.
Abstract
Entity resolution -- identifying database records that refer to the same real-world entity -- is naturally modelled on bipartite graphs connecting entity nodes to their attribute values. Applying a message-passing neural network (MPNN) with all available extensions (reverse message passing, port numbering, ego IDs) incurs unnecessary overhead, since different entity resolution tasks have fundamentally different complexity. For a given matching criterion, what is the cheapest MPNN architecture that provably works? We answer this with a four-theorem separation theory on typed entity-attribute graphs. We introduce co-reference predicates (two same-type entities share at least attribute values) and the -cycle predicate for settings with entity-entity edges. For each predicate we prove tight bounds -- constructing graph pairs provably…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
