Demystifying Multilingual Chain-of-Thought in Process Reward Modeling
Weixuan Wang, Minghao Wu, Barry Haddow, Alexandra Birch

TL;DR
This paper extends process reward modeling to multilingual settings, training models on seven languages and demonstrating improved reasoning accuracy and error reduction across 11 languages, advancing multilingual complex reasoning capabilities.
Contribution
It introduces multilingual process reward models trained on translated datasets, addressing the challenge of multilingual multi-step reasoning in LLMs.
Findings
Multilingual PRMs improve average accuracy across 11 languages.
Multilingual PRMs reduce early-stage reasoning errors.
Performance is sensitive to training languages and data volume.
Abstract
Large language models (LLMs) are designed to perform a wide range of tasks. To improve their ability to solve complex problems requiring multi-step reasoning, recent research leverages process reward modeling to provide fine-grained feedback at each step of the reasoning process for reinforcement learning (RL), but it predominantly focuses on English. In this paper, we tackle the critical challenge of extending process reward models (PRMs) to multilingual settings. To achieve this, we train multilingual PRMs on a dataset spanning seven languages, which is translated from English. Through comprehensive evaluations on two widely used reasoning benchmarks across 11 languages, we demonstrate that multilingual PRMs not only improve average accuracy but also reduce early-stage reasoning errors. Furthermore, our results highlight the sensitivity of multilingual PRMs to both the number of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBusiness Process Modeling and Analysis
