Linking Sequences of Events with Sparse or No Common Occurrence across Data Sets
Yunsung Kim

TL;DR
This paper introduces LDA-Link, a probabilistic model that links sequences of events across datasets even when they share few or no common events, enhancing privacy analysis and data integration.
Contribution
It formalizes the sequence linkage problem and proposes LDA-Link, a novel model that detects latent similarities without relying on shared events, unlike prior domain-specific methods.
Findings
LDA-Link outperforms existing solutions in linking sparse or no-overlap sequences.
The model effectively links social media profiles with no common posts.
LDA-Link demonstrates robustness across different data domains.
Abstract
Data of practical interest - such as personal records, transaction logs, and medical histories - are sequential collections of events relevant to a particular source entity. Recent studies have attempted to link sequences that represent a common entity across data sets to allow more comprehensive statistical analyses and to identify potential privacy failures. Yet, current approaches remain tailored to their specific domains of application, and they fail when co-referent sequences in different data sets contain sparse or no common events, which occurs frequently in many cases. To address this, we formalize the general problem of "sequence linkage" and describe "LDA-Link," a generic solution that is applicable even when co-referent event sequences contain no common items at all. LDA-Link is built upon "Split-Document" model, a new mixed-membership probabilistic model for the generation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Privacy-Preserving Technologies in Data · Topic Modeling
