Quantum-RAG and PunGPT2: Advancing Low-Resource Language Generation and Retrieval for the Punjabi Language

Jaskaranjeet Singh; Rakesh Thakur

arXiv:2508.01918·cs.CL·October 6, 2025

Quantum-RAG and PunGPT2: Advancing Low-Resource Language Generation and Retrieval for the Punjabi Language

Jaskaranjeet Singh, Rakesh Thakur

PDF

Open Access 3 Reviews

TL;DR

This paper introduces PunGPT2, a comprehensive open-source Punjabi language model suite, and Quantum-RAG, a novel quantum-inspired retrieval method, significantly improving low-resource Punjabi NLP tasks with state-of-the-art results.

Contribution

It presents the first fully open-source Punjabi generative model, a retrieval-augmented framework, and a quantum-inspired retrieval method for low-resource language processing.

Findings

01

PunGPT2 outperforms multilingual models on Punjabi benchmarks.

02

Quantum-RAG improves retrieval recall and translation quality.

03

All resources and models are publicly released.

Abstract

Despite rapid advances in large language models (LLMs), low-resource languages remain excluded from NLP, limiting digital access for millions. We present PunGPT2, the first fully open-source Punjabi generative model suite, trained on a 35GB corpus covering literature, religious texts, news, social discourse, etc. PunGPT2 captures Punjabi's syntactic and morphological richness through a tokenizer optimized for Gurmukhi and Shahmukhi scripts. We introduce Pun-RAG, a retrieval-augmented framework integrating PunGPT2 with a FAISS retriever over a curated Punjabi knowledge base, and Pun-Instruct, an instruction-tuned variant using QLoRA for robust zero-shot summarization, translation, and question answering. Our key innovation, Quantum-RAG, fuses sparse, dense, and quantum kernel embeddings for efficient, context-aware retrieval with low memory overhead, marking the first practical…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 4

Strengths

- This work focuses on RAG in Punjabi language, which is a low-resource language. - This work introduces a lot of recourses for a low-resource language, including a 35GB dataset, a GPT-2-based LLM, an instruction-tuned model for Punjabi language, and a benchmark for evaluation.

Weaknesses

- Many figures and tables are flawed. For instance, table 1 and table 2 is too wide. Figure 1 is too small and I can't read it at all. Figure 3 is blurred. - This work seems to be applying lots of existing methods on a low-resource language, which may not fit well for ICLR. - The paper is poorly written. - The evaluation is very weak, cannot find multiple experiment results, including downstream task evaluation and ablation study.

Reviewer 02Rating 4Confidence 4

Strengths

The paper addresses a meaningful gap in NLP by developing resources for the underserved Punjabi language, with a commitment to open releases of datasets, models, and code that will benefit the research community. The work presents a comprehensive end-to-end pipeline spanning pretraining, retrieval-augmented generation, and instruction tuning, accompanied by thorough training details and systematic ablation studies. A key technical contribution is the hybrid retrieval approach that combines multi

Weaknesses

While the paper makes a timely and valuable push toward open Punjabi NLP with a coherent LM-RAG-instruction suite, several limitations remain that collectively weaken the empirical and methodological claims. First, mBERT — an encoder-only representation model — is compared to decoder LMs using ROUGE-L, a summarization/generation metric, which is not the right lens to assess mBERT’s contribution. Second, the retrieval methodology is under-specified and under-contextualized. The paper does not d

Reviewer 03Rating 0Confidence 5

Strengths

1. The authors seem to have put a lot of effort into trying to present good-quality human evals.

Weaknesses

1. The presentation, writing, and overall motives remain unclear. The authors start with presenting a contribution for training a model on Punjabi data, then move on to Quantum-RAG and then show results where Quantum-RAG performs the best. I'm unsure if a) this is a good fit for ICLR and b) this paper should be split into multiple different parts, with each part suitable for a different audience. For example, a technical report on pre-training/post-training a GPT-2 model for Punjabi and the eval

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Sentiment Analysis and Opinion Mining