Exploiting Redundancy, Recurrence and Parallelism: How to Link Millions of Addresses with Ten Lines of Code in Ten Minutes
Yuhang Zhang, Tania Churchill, Kee Siong Ng

TL;DR
This paper introduces a scalable, robust address linking algorithm that avoids complex standardisation, enabling rapid and accurate data fusion across large government datasets with minimal code and time.
Contribution
The paper presents a novel, simple, and efficient address linking method that handles data quality issues without standardisation, suitable for large-scale government data integration.
Findings
Successfully linked large address datasets with high accuracy
Achieved rapid processing in ten minutes with minimal code
Demonstrated robustness to data quality issues
Abstract
Accurate and efficient record linkage is an open challenge of particular relevance to Australian Government Agencies, who recognise that so-called wicked social problems are best tackled by forming partnerships founded on large-scale data fusion. Names and addresses are the most common attributes on which data from different government agencies can be linked. In this paper, we focus on the problem of address linking. Linkage is particularly problematic when the data has significant quality issues. The most common approach for dealing with quality issues is to standardise raw data prior to linking. If a mistake is made in standardisation, however, it is usually impossible to recover from it to perform linkage correctly. This paper proposes a novel algorithm for address linking that is particularly practical for linking large disparate sets of addresses, being highly scalable, robust to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSparse and Compressive Sensing Techniques · Quantum-Dot Cellular Automata · SARS-CoV-2 detection and testing
