Themis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria Scoring

Indraneil Paul; Goran Glava\v{s}; Iryna Gurevych

arXiv:2605.00754·cs.SE·May 11, 2026

Themis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria Scoring

Indraneil Paul, Goran Glava\v{s}, Iryna Gurevych

PDF

1 Repo

TL;DR

This paper introduces Themis, a suite of multilingual code reward models trained for flexible multi-criteria scoring, supported by a new benchmark and a large open-source preference dataset.

Contribution

It presents a new benchmark, a large preference dataset, and a suite of reward models for multi-criteria code scoring across multiple languages.

Findings

01

Reward models show positive scaling with size.

02

Cross-lingual transfer improves with diverse preferences.

03

Multi-criteria training enhances code reward model reliability.

Abstract

Reward models (RMs) have become an indispensable fixture of the language model (LM) post-training playbook, enabling policy alignment and test-time scaling. Research on the application of RMs in code generation, however, has been comparatively sparse, with existing work largely focusing on execution feedback. This choice constrains post-training to optimizing functional correctness over self-contained executable code. In this work, we examine the training and evaluation of multilingual, multi-criteria code RMs. To this end, we first compile Themis-CodeRewardBench, a benchmark to evaluate code RMs across five preference dimensions (i.e., criteria) and eight programming languages, on which we profile 50+ code, math, and general-purpose RMs. Observing the limited proficiency of current RMs beyond scoring for functional correctness, we develop Themis-CodePreference, the largest open-source…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ineil77/Themis
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.