LittiChoQA: Literary Texts in Indic Languages Chosen for Question Answering
Aarya Khandelwal, Ritwik Mishra, Rajiv Ratn Shah

TL;DR
LittiChoQA introduces the largest literary question answering dataset for Indic languages, enabling evaluation of multilingual models on long-context literary QA with insights into performance and efficiency trade-offs.
Contribution
The paper presents LittiChoQA, a large-scale, multilingual literary QA dataset for Indic languages, and evaluates multiple models on long-context QA tasks.
Findings
Full-context fine-tuning improves semantic scores.
Context shortening increases throughput.
Krutrim-2 outperforms other models.
Abstract
Long-context question answering (QA) over literary texts poses significant challenges for modern large language models, particularly in low-resource languages. We address the scarcity of long-context QA resources for Indic languages by introducing LittiChoQA, the largest literary QA dataset to date covering many languages spoken in the Gangetic plains of India. The dataset comprises over 270K automatically generated question-answer pairs with a balanced distribution of factoid and non-factoid questions, generated from naturally authored literary texts collected from the open web. We evaluate multiple multilingual LLMs on non-factoid, abstractive QA, under both full-context and context-shortened settings. Results demonstrate a clear trade-off between performance and efficiency: full-context fine-tuning yields the highest token-level and semantic-level scores, while context shortening…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
