Testing Cross-Lingual Text Comprehension In LLMs Using Next Sentence Prediction
Ritesh Sunil Chavan, Jack Mostow

TL;DR
This study evaluates large language models' cross-lingual comprehension using a Next Sentence Prediction benchmark across English, Swahili, and Hausa, revealing resource-dependent performance and nuanced effects of Chain-of-Thought prompting.
Contribution
It introduces a large-scale cross-lingual NSP benchmark and analyzes how different models and prompting techniques perform across resource-scarce languages.
Findings
Models excel in English but struggle in low-resource languages.
Chain-of-Thought prompting improves performance for weaker models like LLaMA 3.
For stronger models, Chain-of-Thought can sometimes decrease accuracy.
Abstract
While large language models are trained on massive datasets, this data is heavily skewed towards English. Does their impressive performance reflect genuine ability or just this data advantage? To find out, we tested them in a setting where they could not rely on data abundance: low-resource languages. Building on prior work Agarwal et al. (2025) that used Next Sentence Prediction (NSP) as a test, we created a large-scale benchmark with 10,000 questions each for English (a high-resource language), Swahili (medium-resource), and Hausa (low-resource). We then tested several top models, including GPT-4 Turbo, Gemini 1.5 Flash, and LLaMA 3 70B, to see how their performance holds up. The results painted a clear picture of how levels of language resources impact outcomes. While all models excelled in English, their accuracy dropped in Swahili and fell sharply in Hausa, with LLaMA 3 struggling…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
