NLPBench: Evaluating Large Language Models on Solving NLP Problems

Linxin Song; Jieyu Zhang; Lechao Cheng; Pengyuan Zhou; Tianyi Zhou,; Irene Li

arXiv:2309.15630·cs.CL·October 20, 2023·1 cites

NLPBench: Evaluating Large Language Models on Solving NLP Problems

Linxin Song, Jieyu Zhang, Lechao Cheng, Pengyuan Zhou, Tianyi Zhou,, Irene Li

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces NLPBench, a comprehensive benchmark dataset for evaluating large language models on NLP problem-solving tasks, revealing varied performance and prompting strategy effects.

Contribution

We created NLPBench, a novel dataset with diverse NLP questions, and systematically evaluated LLMs using advanced prompting techniques to identify strengths and weaknesses.

Findings

01

Advanced prompts can sometimes reduce LLM performance.

02

Smaller models like LLAMA-2 struggle with logical reasoning.

03

LLMs show specific weaknesses in scientific problem-solving.

Abstract

Recent developments in large language models (LLMs) have shown promise in enhancing the capabilities of natural language processing (NLP). Despite these successes, there remains a dearth of research dedicated to the NLP problem-solving abilities of LLMs. To fill the gap in this area, we present a unique benchmarking dataset, NLPBench, comprising 378 college-level NLP questions spanning various NLP topics sourced from Yale University's prior final exams. NLPBench includes questions with context, in which multiple sub-questions share the same public information, and diverse question types, including multiple choice, short answer, and math. Our evaluation, centered on LLMs such as GPT-3.5/4, PaLM-2, and LLAMA-2, incorporates advanced prompting strategies like the chain-of-thought (CoT) and tree-of-thought (ToT). Our study reveals that the effectiveness of the advanced prompting strategies…

Peer Reviews

Decision·Submitted to ICLR 2024

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

* Interesting angle and scope for LLM evaluation * Experiments of the effectiveness of different prompting strategies under this task

Weaknesses

* The scope of the evaluation is quite limited * The dataset is small and relatively the evaluation cost is high

Reviewer 02Rating 5· marginally below the acceptance thresholdConfidence 3

Strengths

- This work aim to expand the scope of LLM benchmarking and address new perspectives by contributing new datasets and benchmarking scenarios - The evaluation includes both close and open models. It is important to understand the gap between these model types

Weaknesses

- The size of the dataset is relatively small and might limit the conclusion drawn from the results. - The gap filled by the proposed by introducing NLPBench (Table 6) is rather narrow and needs stronger justification. It covers a broader context than the claimed main focus of NLP-related topics such as Math. - The analysis conclusion are mostly known, such as small LLMs have inconsistent results with advanced prompting. LLM limitation on problem-solving tasks.

Reviewer 03Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

1. Introduces a new dataset that challenges the current state-of-the-art prompting strategies and which is useful in evaluating the performance of LLMs. 2. The paper is well-written and easy to follow. 3. The authors carried out extensive experiments and evaluations that included recent prompting approaches and LLMs.

Weaknesses

In general, NLPBench is useful, but I think there are some clarities on how it was collected that are missing: 1. How did you get access to the final exams included in the dataset? How do you ensure that these exams were not already online and that some recent models like GPT-4 have already included them in their training dataset? It was not clear from the paper how you checked that. You mentioned that you curate questions that are not readily accessible online and couldn’t be easily extracte

Code & Models

Repositories

linxins97/nlpbench
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification

Methods15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Cosine Annealing · Residual Connection · Layer Normalization · Byte Pair Encoding · {Dispute@FaQ-s}How to file a dispute with Expedia? · Dense Connections · Linear Warmup With Cosine Annealing · Linear Layer