Smart Bilingual Focused Crawling of Parallel Documents
Cristian Garc\'ia-Romero, Miquel Espl\`a-Gomis, Felipe S\'anchez-Mart\'inez

TL;DR
This paper introduces a neural-guided web crawling method that efficiently finds parallel multilingual documents by fine-tuning a pre-trained language model for URL-based language and parallel link inference.
Contribution
It presents a novel approach combining two fine-tuned models to improve the early discovery of parallel content during web crawling, reducing unnecessary downloads.
Findings
The models effectively predict document language from URLs.
The combined approach enhances early parallel content discovery.
Results show increased parallel document retrieval compared to traditional methods.
Abstract
Crawling parallel texts -- texts that are mutual translations -- from the Internet is usually done following a brute-force approach: documents are massively downloaded in an unguided process, and only a fraction of them end up leading to actual parallel content. In this work we propose a smart crawling method that guides the crawl towards finding parallel content more rapidly. We follow a neural approach that consists in adapting a pre-trained multilingual language model based on the encoder of the Transformer architecture by fine-tuning it for two new tasks: inferring the language of a document from its Uniform Resource Locator (URL), and inferring whether a pair of URLs link to parallel documents. We evaluate both models in isolation and their integration into a crawling tool. The results demonstrate the individual effectiveness of both models, and highlight that their combination…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
