Semi-supervised URL Segmentation with Recurrent Neural Networks Pre-trained on Knowledge Graph Entities
Hao Zhang, Jae Ro, Richard Sproat

TL;DR
This paper presents a semi-supervised approach for URL segmentation using RNNs, enhanced by pre-training on knowledge graph entities, achieving significant accuracy improvements for domain name parsing.
Contribution
The paper introduces a novel pre-training method on knowledge graph entities to improve RNN-based URL segmentation, addressing data scarcity issues.
Findings
Pre-training improves model accuracy by 33%.
Sequence accuracy reaches 85%.
Effective segmentation of domain names like openresearch.
Abstract
Breaking domain names such as openresearch into component words open and research is important for applications like Text-to-Speech synthesis and web search. We link this problem to the classic problem of Chinese word segmentation and show the effectiveness of a tagging model based on Recurrent Neural Networks (RNNs) using characters as input. To compensate for the lack of training data, we propose a pre-training method on concatenated entity names in a large knowledge database. Pre-training improves the model by 33% and brings the sequence accuracy to 85%.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
