RUPBench: Benchmarking Reasoning Under Perturbations for Robustness   Evaluation in Large Language Models

Yuqing Wang; Yun Zhao

arXiv:2406.11020·cs.CL·June 18, 2024

RUPBench: Benchmarking Reasoning Under Perturbations for Robustness Evaluation in Large Language Models

Yuqing Wang, Yun Zhao

PDF

Open Access 1 Repo

TL;DR

This paper introduces RUPBench, a comprehensive benchmark for evaluating the robustness of large language models across diverse reasoning tasks and textual perturbations, revealing model strengths and weaknesses.

Contribution

RUPBench is the first benchmark to systematically assess LLM robustness across multiple reasoning types and perturbation levels, providing detailed analysis of model error patterns.

Findings

01

Larger models show greater robustness to perturbations

02

Common errors include logical inconsistencies and lexical misunderstandings

03

Performance drops significantly under certain perturbations

Abstract

With the increasing use of large language models (LLMs), ensuring reliable performance in diverse, real-world environments is essential. Despite their remarkable achievements, LLMs often struggle with adversarial inputs, significantly impacting their effectiveness in practical applications. To systematically understand the robustness of LLMs, we present RUPBench, a comprehensive benchmark designed to evaluate LLM robustness across diverse reasoning tasks. Our benchmark incorporates 15 reasoning datasets, categorized into commonsense, arithmetic, logical, and knowledge-intensive reasoning, and introduces nine types of textual perturbations at lexical, syntactic, and semantic levels. By examining the performance of state-of-the-art LLMs such as GPT-4o, Llama3, Phi-3, and Gemma on both original and perturbed datasets, we provide a detailed analysis of their robustness and error patterns.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

eternityyw/rupbench
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques