Similarity Classification of Public Transit Stations

Hannah Bast; Patrick Brosi; Markus N\"ather

arXiv:2012.15267·cs.DB·January 1, 2021

Similarity Classification of Public Transit Stations

Hannah Bast, Patrick Brosi, Markus N\"ather

PDF

Open Access 1 Repo

TL;DR

This paper addresses the challenge of accurately determining whether two public transit station identifiers refer to the same station, proposing a machine learning approach that significantly outperforms baseline methods.

Contribution

The authors develop a random forest classifier utilizing trigrams, distance, and grid position features, achieving over 99% F1 score on large datasets, improving over traditional similarity measures.

Findings

01

Learning-based approach achieves over 99% F1 score.

02

Baseline methods reach at most 94% F1 score.

03

Naive distance threshold method scores only 75% F1.

Abstract

We study the following problem: given two public transit station identifiers A and B, each with a label and a geographic coordinate, decide whether A and B describe the same station. For example, for "St Pancras International" at (51.5306, -0.1253) and "London St Pancras" at (51.5319, -0.1269), the answer would be "Yes". This problem frequently arises in areas where public transit data is used, for example in geographic information systems, schedule merging, route planning, or map matching. We consider several baseline methods based on geographic distance and simple string similarity measures. We also experiment with more elaborate string similarity measures and manually created normalization rules. Our experiments show that these baseline methods produce good, but not fully satisfactory results. We therefore develop an approach based on a random forest classifier which is trained on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ad-freiburg/statsimi-eval
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Management and Algorithms · Data Quality and Management · Geographic Information Systems Studies