Similarity Classification of Public Transit Stations
Hannah Bast, Patrick Brosi, Markus N\"ather

TL;DR
This paper addresses the challenge of accurately determining whether two public transit station identifiers refer to the same station, proposing a machine learning approach that significantly outperforms baseline methods.
Contribution
The authors develop a random forest classifier utilizing trigrams, distance, and grid position features, achieving over 99% F1 score on large datasets, improving over traditional similarity measures.
Findings
Learning-based approach achieves over 99% F1 score.
Baseline methods reach at most 94% F1 score.
Naive distance threshold method scores only 75% F1.
Abstract
We study the following problem: given two public transit station identifiers A and B, each with a label and a geographic coordinate, decide whether A and B describe the same station. For example, for "St Pancras International" at (51.5306, -0.1253) and "London St Pancras" at (51.5319, -0.1269), the answer would be "Yes". This problem frequently arises in areas where public transit data is used, for example in geographic information systems, schedule merging, route planning, or map matching. We consider several baseline methods based on geographic distance and simple string similarity measures. We also experiment with more elaborate string similarity measures and manually created normalization rules. Our experiments show that these baseline methods produce good, but not fully satisfactory results. We therefore develop an approach based on a random forest classifier which is trained on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Management and Algorithms · Data Quality and Management · Geographic Information Systems Studies
