See What LLMs Cannot Answer: A Self-Challenge Framework for Uncovering LLM Weaknesses
Yulong Chen, Yang Liu, Jianhao Yan, Xuefeng Bai, Ming Zhong, Yinghao, Yang, Ziyi Yang, Chenguang Zhu, Yue Zhang

TL;DR
This paper introduces a Self-Challenge framework enabling LLMs to identify their own weaknesses by generating challenging instances, leading to a new benchmark that reveals the limitations of current models like GPT-4, Claude-3, and Llama-3.
Contribution
The paper presents a novel self-assessment approach for LLMs to discover their limitations and creates a challenging benchmark, SC-G4, for evaluating LLMs' ability to handle difficult instances.
Findings
GPT-4 correctly answers 44.96% of SC-G4 instances.
Error patterns challenge multiple LLMs beyond GPT-4.
Fine-tuning does not fully resolve identified weaknesses.
Abstract
The impressive performance of Large Language Models (LLMs) has consistently surpassed numerous human-designed benchmarks, presenting new challenges in assessing the shortcomings of LLMs. Designing tasks and finding LLMs' limitations are becoming increasingly important. In this paper, we investigate the question of whether an LLM can discover its own limitations from the errors it makes. To this end, we propose a Self-Challenge evaluation framework with human-in-the-loop. Starting from seed instances that GPT-4 fails to answer, we prompt GPT-4 to summarize error patterns that can be used to generate new instances and incorporate human feedback on them to refine these patterns for generating more challenging data, iteratively. We end up with 8 diverse patterns, such as text manipulation and questions with assumptions. We then build a benchmark, SC-G4, consisting of 1,835 instances…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLaw, AI, and Intellectual Property · Artificial Intelligence in Law
MethodsLinear Layer · Residual Connection · Layer Normalization · Multi-Head Attention · Position-Wise Feed-Forward Layer · Adam · Attention Is All You Need · Byte Pair Encoding · Absolute Position Encodings · Softmax
