Beyond Metrics: Evaluating LLMs' Effectiveness in Culturally Nuanced,   Low-Resource Real-World Scenarios

Millicent Ochieng; Varun Gumma; Sunayana Sitaram; Jindong Wang,; Vishrav Chaudhary; Keshet Ronen; Kalika Bali; Jacki O'Neill

arXiv:2406.00343·cs.CL·June 14, 2024·1 cites

Beyond Metrics: Evaluating LLMs' Effectiveness in Culturally Nuanced, Low-Resource Real-World Scenarios

Millicent Ochieng, Varun Gumma, Sunayana Sitaram, Jindong Wang,, Vishrav Chaudhary, Keshet Ronen, Kalika Bali, Jacki O'Neill

PDF

Open Access 1 Video

TL;DR

This study evaluates the effectiveness of seven leading LLMs in sentiment analysis within multilingual, code-mixed WhatsApp chats, highlighting their strengths and weaknesses in understanding cultural and linguistic nuances in low-resource scenarios.

Contribution

It provides a comprehensive evaluation of LLMs' performance in culturally nuanced, low-resource multilingual settings, emphasizing the need for improved benchmarks and transparency.

Findings

01

GPT-4 and GPT-4-Turbo excelled in understanding diverse linguistic and contextual nuances.

02

Most LLMs struggled with cultural nuances and transparency in decision-making.

03

High-performing models still face challenges in low-resource, culturally nuanced environments.

Abstract

The deployment of Large Language Models (LLMs) in real-world applications presents both opportunities and challenges, particularly in multilingual and code-mixed communication settings. This research evaluates the performance of seven leading LLMs in sentiment analysis on a dataset derived from multilingual and code-mixed WhatsApp chats, including Swahili, English and Sheng. Our evaluation includes both quantitative analysis using metrics like F1 score and qualitative assessment of LLMs' explanations for their predictions. We find that, while Mistral-7b and Mixtral-8x7b achieved high F1 scores, they and other LLMs such as GPT-3.5-Turbo, Llama-2-70b, and Gemma-7b struggled with understanding linguistic and contextual nuances, as well as lack of transparency in their decision-making process as observed from their explanations. In contrast, GPT-4 and GPT-4-Turbo excelled in grasping…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Beyond Metrics: Evaluating LLMs’ Effectiveness in Culturally Nuanced, Low-Resource Real-World Scenarios· underline

Taxonomy

TopicsPrivate Equity and Venture Capital · Cooperative Studies and Economics · FinTech, Crowdfunding, Digital Finance

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Cosine Annealing · Softmax · {Dispute@FaQ-s}How to file a dispute with Expedia? · Layer Normalization · Weight Decay · Attention Dropout · Linear Layer · Linear Warmup With Cosine Annealing