Benchmarking GPT-4 against Human Translators: A Comprehensive Evaluation   Across Languages, Domains, and Expertise Levels

Jianhao Yan; Pingchuan Yan; Yulong Chen; Jing Li; Xianchao Zhu; Yue; Zhang

arXiv:2411.13775·cs.CL·November 22, 2024

Benchmarking GPT-4 against Human Translators: A Comprehensive Evaluation Across Languages, Domains, and Expertise Levels

Jianhao Yan, Pingchuan Yan, Yulong Chen, Jing Li, Xianchao Zhu, Yue, Zhang

PDF

Open Access 1 Repo

TL;DR

This paper systematically compares GPT-4's translation performance with human translators across multiple languages, domains, and expertise levels, revealing GPT-4's strengths and limitations in real-world translation tasks.

Contribution

It provides the first comprehensive evaluation of GPT-4 against human translators across languages, domains, and proficiency levels, highlighting its consistent quality and unique translation patterns.

Findings

01

GPT-4 performs comparably to junior human translators.

02

GPT-4 maintains consistent quality across resource-poor languages.

03

Human translators sometimes hallucinate or over-interpret context.

Abstract

This study presents a comprehensive evaluation of GPT-4's translation capabilities compared to human translators of varying expertise levels. Through systematic human evaluation using the MQM schema, we assess translations across three language pairs (Chinese $⟷$ English, Russian $⟷$ English, and Chinese $⟷$ Hindi) and three domains (News, Technology, and Biomedical). Our findings reveal that GPT-4 achieves performance comparable to junior-level translators in terms of total errors, while still lagging behind senior translators. Unlike traditional Neural Machine Translation systems, which show significant performance degradation in resource-poor language directions, GPT-4 maintains consistent translation quality across all evaluated language pairs. Through qualitative analysis, we identify distinctive patterns in translation approaches:…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

elliottyan/gpt_versus_mt_experts
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education

MethodsAttention Is All You Need · Dense Connections · Label Smoothing · Layer Normalization · Dropout · Adam · Residual Connection · Byte Pair Encoding · Linear Layer · Softmax