Mis-prompt: Benchmarking Large Language Models for Proactive Error Handling

Jiayi Zeng; Yizhe Feng; Mengliang He; Wenhui Lei; Wei Zhang; Zeming Liu; Xiaoming Shi; Aimin Zhou

arXiv:2506.00064·cs.CL·June 3, 2025

Mis-prompt: Benchmarking Large Language Models for Proactive Error Handling

Jiayi Zeng, Yizhe Feng, Mengliang He, Wenhui Lei, Wei Zhang, Zeming Liu, Xiaoming Shi, Aimin Zhou

PDF

Open Access 1 Video

TL;DR

This paper introduces Mis-prompt, a benchmark for evaluating large language models' ability to handle errors proactively without explicit instructions, revealing current limitations and potential improvements through supervised fine-tuning.

Contribution

It presents a new benchmark, evaluation tasks, and dataset for proactive error handling in LLMs, addressing a gap in current error management research.

Findings

01

Current LLMs perform poorly on proactive error handling.

02

Supervised fine-tuning improves LLMs' error handling capabilities.

03

The dataset will be publicly available for further research.

Abstract

Large language models (LLMs) have demonstrated significant advancements in error handling. Current error-handling works are performed in a passive manner, with explicit error-handling instructions. However, in real-world scenarios, explicit error-handling instructions are usually unavailable. In this paper, our work identifies this challenge as how to conduct proactive error handling without explicit error handling instructions. To promote further research, this work introduces a new benchmark, termed Mis-prompt, consisting of four evaluation tasks, an error category taxonomy, and a new evaluation dataset. Furthermore, this work analyzes current LLMs' performance on the benchmark, and the experimental results reveal that current LLMs show poor performance on proactive error handling, and SFT on error handling instances improves LLMs' proactive error handling capabilities. The dataset…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Mis-prompt: Benchmarking Large Language Models for Proactive Error Handling· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Software Engineering Research