PARIKSHA: A Large-Scale Investigation of Human-LLM Evaluator Agreement   on Multilingual and Multi-Cultural Data

Ishaan Watts; Varun Gumma; Aditya Yadavalli; Vivek Seshadri; Manohar; Swaminathan; Sunayana Sitaram

arXiv:2406.15053·cs.CL·October 21, 2024·1 cites

PARIKSHA: A Large-Scale Investigation of Human-LLM Evaluator Agreement on Multilingual and Multi-Cultural Data

Ishaan Watts, Varun Gumma, Aditya Yadavalli, Vivek Seshadri, Manohar, Swaminathan, Sunayana Sitaram

PDF

Open Access

TL;DR

This study evaluates multilingual LLMs using extensive human and LLM-based assessments across Indic languages, revealing insights into model performance, agreement levels, and biases in multilingual and multicultural contexts.

Contribution

It introduces large-scale multilingual evaluation benchmarks and analyzes human-LLM agreement and biases, addressing gaps in linguistic diversity and cultural nuances.

Findings

01

GPT-4o and Llama-3 70B perform best across Indic languages.

02

Human-LLM agreement is high in pairwise but lower in direct assessments.

03

Evidence of self-bias exists in GPT-based evaluations.

Abstract

Evaluation of multilingual Large Language Models (LLMs) is challenging due to a variety of factors -- the lack of benchmarks with sufficient linguistic diversity, contamination of popular benchmarks into LLM pre-training data and the lack of local, cultural nuances in translated benchmarks. In this work, we study human and LLM-based evaluation in a multilingual, multi-cultural setting. We evaluate 30 models across 10 Indic languages by conducting 90K human evaluations and 30K LLM-based evaluations and find that models such as GPT-4o and Llama-3 70B consistently perform best for most Indic languages. We build leaderboards for two evaluation settings - pairwise comparison and direct assessment and analyze the agreement between humans and LLMs. We find that humans and LLMs agree fairly well in the pairwise setting but the agreement drops for direct assessment evaluation especially for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Interpreting and Communication in Healthcare