Bilingual Document Alignment with Latent Semantic Indexing

Ulrich Germann

arXiv:1707.09443·cs.CL·August 1, 2017

Bilingual Document Alignment with Latent Semantic Indexing

Ulrich Germann

PDF

Open Access

TL;DR

This paper presents a bilingual document alignment method using cross-lingual Latent Semantic Indexing, achieving high recall rates in aligning English and French web pages without extensive in-domain data.

Contribution

The paper introduces a novel bilingual alignment approach leveraging latent semantic indexing and cosine similarity, improving alignment accuracy without relying heavily on in-domain data.

Findings

01

Achieves approximately 88% recall without in-domain data.

02

Improves to 93% recall when in-domain data is used.

03

Proposes a new evaluation method accounting for duplicates.

Abstract

We apply cross-lingual Latent Semantic Indexing to the Bilingual Document Alignment Task at WMT16. Reduced-rank singular value decomposition of a bilingual term-document matrix derived from known English/French page pairs in the training data allows us to map monolingual documents into a joint semantic space. Two variants of cosine similarity between the vectors that place each document into the joint semantic space are combined with a measure of string similarity between corresponding URLs to produce 1:1 alignments of English/French web pages in a variety of domains. The system achieves a recall of ca. 88% if no in-domain data is used for building the latent semantic model, and 93% if such data is included. Analysing the system's errors on the training data, we argue that evaluating aligner performance based on exact URL matches under-estimates their true performance and propose an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis