NLPBench: Evaluating Large Language Models on Solving NLP Problems
Linxin Song, Jieyu Zhang, Lechao Cheng, Pengyuan Zhou, Tianyi Zhou,, Irene Li

TL;DR
This paper introduces NLPBench, a comprehensive benchmark dataset for evaluating large language models on NLP problem-solving tasks, revealing varied performance and prompting strategy effects.
Contribution
We created NLPBench, a novel dataset with diverse NLP questions, and systematically evaluated LLMs using advanced prompting techniques to identify strengths and weaknesses.
Findings
Advanced prompts can sometimes reduce LLM performance.
Smaller models like LLAMA-2 struggle with logical reasoning.
LLMs show specific weaknesses in scientific problem-solving.
Abstract
Recent developments in large language models (LLMs) have shown promise in enhancing the capabilities of natural language processing (NLP). Despite these successes, there remains a dearth of research dedicated to the NLP problem-solving abilities of LLMs. To fill the gap in this area, we present a unique benchmarking dataset, NLPBench, comprising 378 college-level NLP questions spanning various NLP topics sourced from Yale University's prior final exams. NLPBench includes questions with context, in which multiple sub-questions share the same public information, and diverse question types, including multiple choice, short answer, and math. Our evaluation, centered on LLMs such as GPT-3.5/4, PaLM-2, and LLAMA-2, incorporates advanced prompting strategies like the chain-of-thought (CoT) and tree-of-thought (ToT). Our study reveals that the effectiveness of the advanced prompting strategies…
Peer Reviews
Decision·Submitted to ICLR 2024
* Interesting angle and scope for LLM evaluation * Experiments of the effectiveness of different prompting strategies under this task
* The scope of the evaluation is quite limited * The dataset is small and relatively the evaluation cost is high
- This work aim to expand the scope of LLM benchmarking and address new perspectives by contributing new datasets and benchmarking scenarios - The evaluation includes both close and open models. It is important to understand the gap between these model types
- The size of the dataset is relatively small and might limit the conclusion drawn from the results. - The gap filled by the proposed by introducing NLPBench (Table 6) is rather narrow and needs stronger justification. It covers a broader context than the claimed main focus of NLP-related topics such as Math. - The analysis conclusion are mostly known, such as small LLMs have inconsistent results with advanced prompting. LLM limitation on problem-solving tasks.
1. Introduces a new dataset that challenges the current state-of-the-art prompting strategies and which is useful in evaluating the performance of LLMs. 2. The paper is well-written and easy to follow. 3. The authors carried out extensive experiments and evaluations that included recent prompting approaches and LLMs.
In general, NLPBench is useful, but I think there are some clarities on how it was collected that are missing: 1. How did you get access to the final exams included in the dataset? How do you ensure that these exams were not already online and that some recent models like GPT-4 have already included them in their training dataset? It was not clear from the paper how you checked that. You mentioned that you curate questions that are not readily accessible online and couldn’t be easily extracte
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
Methods15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Cosine Annealing · Residual Connection · Layer Normalization · Byte Pair Encoding · {Dispute@FaQ-s}How to file a dispute with Expedia? · Dense Connections · Linear Warmup With Cosine Annealing · Linear Layer
