SailCompass: Towards Reproducible and Robust Evaluation for Southeast   Asian Languages

Jia Guo; Longxu Dou; Guangtao Zeng; Stanley Kok; Wei Lu; Qian Liu

arXiv:2412.01186·cs.CL·December 3, 2024

SailCompass: Towards Reproducible and Robust Evaluation for Southeast Asian Languages

Jia Guo, Longxu Dou, Guangtao Zeng, Stanley Kok, Wei Lu, Qian Liu

PDF

Open Access 1 Repo

TL;DR

SailCompass provides a comprehensive, reproducible benchmark for evaluating Large Language Models on Southeast Asian languages, emphasizing robustness, diverse tasks, and advanced prompting techniques to improve model assessment.

Contribution

This work introduces SailCompass, a new benchmark with diverse datasets and evaluation methods for SEA languages, enhancing reproducibility and robustness in LLM evaluation.

Findings

01

SEA-specialized LLMs outperform general models, but the gap narrows.

02

Balanced language datasets are crucial for better SEA LLMs.

03

Advanced prompting techniques improve LLM utilization.

Abstract

In this paper, we introduce SailCompass, a reproducible and robust evaluation benchmark for assessing Large Language Models (LLMs) on Southeast Asian Languages (SEA). SailCompass encompasses three main SEA languages, eight primary tasks including 14 datasets covering three task types (generation, multiple-choice questions, and classification). To improve the robustness of the evaluation approach, we explore different prompt configurations for multiple-choice questions and leverage calibrations to improve the faithfulness of classification tasks. With SailCompass, we derive the following findings: (1) SEA-specialized LLMs still outperform general LLMs, although the gap has narrowed; (2) A balanced language distribution is important for developing better SEA-specialized LLMs; (3) Advanced prompting techniques (e.g., calibration, perplexity-based ranking) are necessary to better utilize…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sail-sg/sailcompass
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques