LittiChoQA: Literary Texts in Indic Languages Chosen for Question Answering

Aarya Khandelwal; Ritwik Mishra; Rajiv Ratn Shah

arXiv:2601.03025·cs.CL·January 7, 2026

LittiChoQA: Literary Texts in Indic Languages Chosen for Question Answering

Aarya Khandelwal, Ritwik Mishra, Rajiv Ratn Shah

PDF

Open Access

TL;DR

LittiChoQA introduces the largest literary question answering dataset for Indic languages, enabling evaluation of multilingual models on long-context literary QA with insights into performance and efficiency trade-offs.

Contribution

The paper presents LittiChoQA, a large-scale, multilingual literary QA dataset for Indic languages, and evaluates multiple models on long-context QA tasks.

Findings

01

Full-context fine-tuning improves semantic scores.

02

Context shortening increases throughput.

03

Krutrim-2 outperforms other models.

Abstract

Long-context question answering (QA) over literary texts poses significant challenges for modern large language models, particularly in low-resource languages. We address the scarcity of long-context QA resources for Indic languages by introducing LittiChoQA, the largest literary QA dataset to date covering many languages spoken in the Gangetic plains of India. The dataset comprises over 270K automatically generated question-answer pairs with a balanced distribution of factoid and non-factoid questions, generated from naturally authored literary texts collected from the open web. We evaluate multiple multilingual LLMs on non-factoid, abstractive QA, under both full-context and context-shortened settings. Results demonstrate a clear trade-off between performance and efficiency: full-context fine-tuning yields the highest token-level and semantic-level scores, while context shortening…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications