Assessing Wikipedia-Based Cross-Language Retrieval Models
Benjamin Roth

TL;DR
This paper evaluates and combines various concept models like pLSA, LDA, and ESA for cross-language retrieval using Wikipedia, demonstrating that combined models and language modeling approaches improve retrieval performance without relying on parallel corpora.
Contribution
It introduces improved Wikipedia-based cross-language retrieval models by adapting pLSA and LDA, and explores effective model combination and language modeling techniques.
Findings
Weighted pLSA outperforms other models.
Combining machine translation with concept models increases performance by 21.1%.
Language modeling with Wikipedia links achieves strong results without parallel corpora.
Abstract
This work compares concept models for cross-language retrieval: First, we adapt probabilistic Latent Semantic Analysis (pLSA) for multilingual documents. Experiments with different weighting schemes show that a weighting method favoring documents of similar length in both language sides gives best results. Considering that both monolingual and multilingual Latent Dirichlet Allocation (LDA) behave alike when applied for such documents, we use a training corpus built on Wikipedia where all documents are length-normalized and obtain improvements over previously reported scores for LDA. Another focus of our work is on model combination. For this end we include Explicit Semantic Analysis (ESA) in the experiments. We observe that ESA is not competitive with LDA in a query based retrieval task on CLEF 2000 data. The combination of machine translation with concept models increased performance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text and Document Classification Technologies
MethodsLinear Discriminant Analysis
