Learn Globally, Speak Locally: Bridging the Gaps in Multilingual Reasoning

Jaedong Hwang; Kumar Tanmay; Seok-Jin Lee; Ayush Agrawal; Hamid Palangi; Kumar Ayush; Ila Fiete; Paul Pu Liang

arXiv:2507.05418·cs.CL·September 29, 2025

Learn Globally, Speak Locally: Bridging the Gaps in Multilingual Reasoning

Jaedong Hwang, Kumar Tanmay, Seok-Jin Lee, Ayush Agrawal, Hamid Palangi, Kumar Ayush, Ila Fiete, Paul Pu Liang

PDF

Open Access 2 Datasets 3 Reviews

TL;DR

This paper introduces M2A, a method that improves multilingual reasoning in large language models by aligning languages at multiple scales and using language consistency rewards, especially benefiting low-resource languages.

Contribution

The paper presents M2A, a novel training approach that enhances reasoning accuracy across languages and introduces GeoFact-X, a benchmark for multilingual factual reasoning with reasoning traces.

Findings

01

M2A significantly improves reasoning fidelity in multiple languages.

02

GeoFact-X reveals reasoning in the target language improves model performance.

03

Multilingual reinforcement learning is key for robust cross-lingual reasoning.

Abstract

Large Language Models (LLMs) have achieved strong performance in domains like mathematics, factual question answering, and code generation, yet their ability to reason on these tasks in different languages remains underdeveloped. Especially for low-resource languages such as Swahili or Thai, LLMs can often misinterpret prompts or default to reasoning in English. This implicit bias toward high-resource languages undermines factual accuracy, interpretability, and trust. We propose M2A, a novel method that combines multi-scale multilingual alignment with language-consistency rewards on machine-translated questions, training models to reason directly and accurately in the target language. Furthermore, existing multilingual benchmarks only evaluate on final answers, overlooking whether reasoning occurs in the intended language. To close this gap, we introduce GeoFact-X, a geography-based…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 3

Strengths

The paper addresses a critical issue of interpretability in low resource languages due to lack of language adherence in LLM reasoning traces. Models often reason in English instead of the target language of the question. The paper introduces a reward modeling approach that combines rewards for language consistency, context and reasoning alignment. The paper also introduces a high quality dataset GEOFACT-X, a geography based multilingual factual reasoning benchmark, in five languages: English, H

Weaknesses

1. While M2A shows improvements on Math or language accuracy over SFT and GRPO individually, it doesn't seem to improve MGSM Math or Language accuracy scores over the baseline Qwen-2.5-Instruct model in Table 2. 2. The proposed M2A approach requires ground truth outputs as well as ground truth reasoning traces for Multilingual Context Alignment and Multilingual Reasoning-Step Alignment as described in Section 3. This is a major limitation especially for resource constrained languages. One of th

Reviewer 02Rating 4Confidence 2

Strengths

1. The paper addresses a relevant problem, namely, ensuring reasoning in LLMs in languages beyond English and Chinese. 2. The paper introduces both a novel method (M2A) and a new LLM-generated dataset (GeoFact-X). GeoFact-X was validated by human annotators, which could potentially be of value to the research community. 3. The paper is generally written clearly, with Figure 1-4 providing visualizations to explain the methodology further. 4. The paper includes ablation studies of the reward funct

Weaknesses

1. The motivation of the work is flawed. The authors argue that reasoning traces in target languages are essential for trustworthiness (line 34) and to "ensure interpretability, i.e. users can directly follow the reasoning in their own language" (line 36-37). At the same time, it has been shown that reasoning steps are not necessarily faithful to the final answer [1, 2, inter alia], and therefore cannot be trusted to provide faithful interpretability for users. 2. The cultural relevance of the d

Reviewer 03Rating 6Confidence 4

Strengths

- Introduced a novel M2A method for enforcing language-consistent reasoning in LLMs. - Presented the GEOFACT-X benchmark to evaluate multilingual reasoning traces. - Highlighted clear empirical gains in both reasoning fidelity and mathematical accuracy. - Proposed a joint accuracy metric combining reasoning correctness and language fidelity. - Explained reasoning and alignment rewards with well-designed figures and concise text.

Weaknesses

- Results rely on a single model family (Qwen-2.5-7B), leaving it unclear how the proposed M2A training generalizes to other architectures. - The LLM-as-judge setup risks stylistic bias, as the evaluator (Qwen2.5-72B) may implicitly favor Gemini-like phrasing and structure. - Reliance on Google Translate may introduce subtle cultural or contextual distortions.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Advanced Graph Neural Networks

MethodsALIGN · Focus