ICE-ID: A Novel Historical Census Dataset for Longitudinal Identity Resolution
Gon\c{c}alo Hora de Carvalho, Lazar S. Popov, Sander Kaatee, M\'ario S. Correia, Kristinn R. Th\'orisson, Tangrui Li, P\'etur H\'uni Bj\"ornsson, Eir\'ikur Sm\'ari Sigur{\dh}arson, Jilles S. Dibangoye

TL;DR
ICE-ID is a comprehensive historical census dataset from Iceland covering 220 years, designed to advance longitudinal identity resolution with unique challenges like hierarchical geography and naming conventions.
Contribution
The paper introduces ICE-ID, a novel, large-scale dataset for long-term identity resolution, including detailed analysis and tools for benchmarking and research.
Findings
Dataset covers 220 years of Icelandic census data.
Includes hierarchical geography and kinship links.
Provides baseline models and analysis artifacts.
Abstract
We introduce \textbf{ICE-ID}, a benchmark dataset comprising 984,028 records from 16 Icelandic census waves spanning 220 years (1703--1920), with 226,864 expert-curated person identifiers. ICE-ID combines hierarchical geography (farmparishdistrictcounty), patronymic naming conventions, sparse kinship links (partner, father, mother), and multi-decadal temporal drift -- challenges not captured by standard product-matching or citation datasets. This paper presents an artifact-backed analysis of temporal coverage, missingness, identifier ambiguity, candidate-generation efficiency, and cluster distributions, and situates ICE-ID against classical ER benchmarks (Abt--Buy, Amazon--Google, DBLP--ACM, DBLP--Scholar, Walmart--Amazon, iTunes--Amazon, Beer, Fodors--Zagats). We also define a deployment-faithful temporal OOD protocol and release the dataset, splits, regeneration…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Mobility and Location-Based Analysis
