AAVENUE: Detecting LLM Biases on NLU Tasks in AAVE via a Novel Benchmark
Abhay Gupta, Philip Meng, Ece Yurtseven, Sean O'Brien, Kevin Zhu

TL;DR
This paper introduces AAVENUE, a benchmark for evaluating large language models' performance on NLU tasks in AAVE versus SAE, revealing biases and emphasizing the need for more inclusive NLP models.
Contribution
The paper presents AAVENUE, a novel benchmark using LLM-based translation for assessing biases in LLMs on AAVE and SAE NLU tasks, extending existing benchmarks with a flexible methodology.
Findings
LLMs perform better on SAE than AAVE tasks, indicating biases.
AAVENUE improves evaluation metrics over previous benchmarks.
Authentic AAVE translations validated by fluent speakers.
Abstract
Detecting biases in natural language understanding (NLU) for African American Vernacular English (AAVE) is crucial to developing inclusive natural language processing (NLP) systems. To address dialect-induced performance discrepancies, we introduce AAVENUE ({AAVE} {N}atural Language {U}nderstanding {E}valuation), a benchmark for evaluating large language model (LLM) performance on NLU tasks in AAVE and Standard American English (SAE). AAVENUE builds upon and extends existing benchmarks like VALUE, replacing deterministic syntactic and morphological transformations with a more flexible methodology leveraging LLM-based translation with few-shot prompting, improving performance across our evaluation metrics when translating key tasks from the GLUE and SuperGLUE benchmarks. We compare AAVENUE and VALUE translations using five popular LLMs and a comprehensive set of metrics including…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
Methods7 Fastest Ways to Call American Airlines Reservations Number (USA Guide) · Sparse Evolutionary Training
