Anveshana: A New Benchmark Dataset for Cross-Lingual Information Retrieval On English Queries and Sanskrit Documents

Manoj Balaji Jagadeeshan; Prince Raj; Pawan Goyal

arXiv:2505.19494·cs.CL·May 27, 2025

Anveshana: A New Benchmark Dataset for Cross-Lingual Information Retrieval On English Queries and Sanskrit Documents

Manoj Balaji Jagadeeshan, Prince Raj, Pawan Goyal

PDF

Open Access 1 Datasets

TL;DR

This paper introduces Anveshana, a benchmark dataset and evaluation framework for cross-lingual information retrieval between English queries and Sanskrit documents, utilizing advanced models and translation techniques to improve access to ancient texts.

Contribution

It provides a new publicly available dataset and evaluates multiple retrieval methods, including translation-based approaches, for Sanskrit-English cross-lingual retrieval.

Findings

01

DT methods outperform DR and QT in retrieval accuracy

02

Fine-tuned models effectively handle Sanskrit's linguistic features

03

The dataset enables further research in Sanskrit information retrieval

Abstract

The study presents a comprehensive benchmark for retrieving Sanskrit documents using English queries, focusing on the chapters of the Srimadbhagavatam. It employs a tripartite approach: Direct Retrieval (DR), Translation-based Retrieval (DT), and Query Translation (QT), utilizing shared embedding spaces and advanced translation methods to enhance retrieval systems in a RAG framework. The study fine-tunes state-of-the-art models for Sanskrit's linguistic nuances, evaluating models such as BM25, REPLUG, mDPR, ColBERT, Contriever, and GPT-2. It adapts summarization techniques for Sanskrit documents to improve QA processing. Evaluation shows DT methods outperform DR and QT in handling the cross-lingual challenges of ancient texts, improving accessibility and understanding. A dataset of 3,400 English-Sanskrit query-document pairs underpins the study, aiming to preserve Sanskrit scriptures…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

manojbalaji1/anveshana
dataset· 152 dl
152 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsText and Document Classification Technologies · Natural Language Processing Techniques · Topic Modeling

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · Linear Layer · Linear Warmup With Cosine Annealing · Attention Dropout · Softmax · WordPiece · Weight Decay · Multi-Head Attention