CRPO: Confidence-Reward Driven Preference Optimization for Machine   Translation

Guofeng Cui; Pichao Wang; Yang Liu; Zemian Ke; Zhu Liu and; Vimal Bhat

arXiv:2501.13927·cs.CL·January 24, 2025

CRPO: Confidence-Reward Driven Preference Optimization for Machine Translation

Guofeng Cui, Pichao Wang, Yang Liu, Zemian Ke, Zhu Liu and, Vimal Bhat

PDF

Open Access 1 Video

TL;DR

CRPO introduces a novel preference optimization method that combines reward scores with model confidence to enhance data selection, leading to improved machine translation performance and efficiency in large language models and encoder-decoder architectures.

Contribution

CRPO presents a new approach integrating confidence and reward signals for better data selection in preference optimization, outperforming existing methods in machine translation.

Findings

01

CRPO outperforms RS-DPO, RSO, and MBR score in translation accuracy.

02

CRPO improves data efficiency in training.

03

CRPO generalizes well to encoder-decoder models like NLLB.

Abstract

Large language models (LLMs) have shown great potential in natural language processing tasks, but their application to machine translation (MT) remains challenging due to pretraining on English-centric data and the complexity of reinforcement learning from human feedback (RLHF). Direct Preference Optimization (DPO) has emerged as a simpler and more efficient alternative, but its performance depends heavily on the quality of preference data. To address this, we propose Confidence-Reward driven Preference Optimization (CRPO), a novel method that combines reward scores with model confidence to improve data selection for fine-tuning. CRPO selects challenging sentence pairs where the model is uncertain or underperforms, leading to more effective learning. While primarily designed for LLMs, CRPO also generalizes to encoder-decoder models like NLLB, demonstrating its versatility. Empirical…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

CRPO: Confidence-Reward Driven Preference Optimization for Machine Translation· underline

Taxonomy

TopicsNatural Language Processing Techniques · Semantic Web and Ontologies