CIBench: Evaluating Your LLMs with a Code Interpreter Plugin
Chuyu Zhang, Songyang Zhang, Yingfan Hu, Haowen Shen, Kuikun Liu,, Zerun Ma, Fengzhe Zhou, Wenwei Zhang, Xuming He, Dahua Lin, Kai Chen

TL;DR
CIBench is an interactive evaluation framework designed to assess large language models' ability to utilize code interpreters for data science tasks, providing a comprehensive benchmark with real-world workflow simulation.
Contribution
This work introduces CIBench, a novel benchmarking framework with an evaluation dataset and modes, specifically targeting LLMs' code interpreter usage in data science workflows.
Findings
24 LLMs evaluated on CIBench with detailed performance analysis
Insights into strengths and limitations of LLMs in code interpreter tasks
Benchmarking results guide future development of LLMs for data science
Abstract
While LLM-Based agents, which use external tools to solve complex problems, have made significant progress, benchmarking their ability is challenging, thereby hindering a clear understanding of their limitations. In this paper, we propose an interactive evaluation framework, named CIBench, to comprehensively assess LLMs' ability to utilize code interpreters for data science tasks. Our evaluation framework includes an evaluation dataset and two evaluation modes. The evaluation dataset is constructed using an LLM-human cooperative approach and simulates an authentic workflow by leveraging consecutive and interactive IPython sessions. The two evaluation modes assess LLMs' ability with and without human assistance. We conduct extensive experiments to analyze the ability of 24 LLMs on CIBench and provide valuable insights for future LLMs in code interpreter utilization.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Library Science and Information Systems
