See What LLMs Cannot Answer: A Self-Challenge Framework for Uncovering   LLM Weaknesses

Yulong Chen; Yang Liu; Jianhao Yan; Xuefeng Bai; Ming Zhong; Yinghao; Yang; Ziyi Yang; Chenguang Zhu; Yue Zhang

arXiv:2408.08978·cs.CL·October 2, 2024·2 cites

See What LLMs Cannot Answer: A Self-Challenge Framework for Uncovering LLM Weaknesses

Yulong Chen, Yang Liu, Jianhao Yan, Xuefeng Bai, Ming Zhong, Yinghao, Yang, Ziyi Yang, Chenguang Zhu, Yue Zhang

PDF

Open Access 1 Repo

TL;DR

This paper introduces a Self-Challenge framework enabling LLMs to identify their own weaknesses by generating challenging instances, leading to a new benchmark that reveals the limitations of current models like GPT-4, Claude-3, and Llama-3.

Contribution

The paper presents a novel self-assessment approach for LLMs to discover their limitations and creates a challenging benchmark, SC-G4, for evaluating LLMs' ability to handle difficult instances.

Findings

01

GPT-4 correctly answers 44.96% of SC-G4 instances.

02

Error patterns challenge multiple LLMs beyond GPT-4.

03

Fine-tuning does not fully resolve identified weaknesses.

Abstract

The impressive performance of Large Language Models (LLMs) has consistently surpassed numerous human-designed benchmarks, presenting new challenges in assessing the shortcomings of LLMs. Designing tasks and finding LLMs' limitations are becoming increasingly important. In this paper, we investigate the question of whether an LLM can discover its own limitations from the errors it makes. To this end, we propose a Self-Challenge evaluation framework with human-in-the-loop. Starting from seed instances that GPT-4 fails to answer, we prompt GPT-4 to summarize error patterns that can be used to generate new instances and incorporate human feedback on them to refine these patterns for generating more challenging data, iteratively. We end up with 8 diverse patterns, such as text manipulation and questions with assumptions. We then build a benchmark, SC-G4, consisting of 1,835 instances…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cylnlp/Self-Challenge-GPT4
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLaw, AI, and Intellectual Property · Artificial Intelligence in Law

MethodsLinear Layer · Residual Connection · Layer Normalization · Multi-Head Attention · Position-Wise Feed-Forward Layer · Adam · Attention Is All You Need · Byte Pair Encoding · Absolute Position Encodings · Softmax