CIBench: Evaluating Your LLMs with a Code Interpreter Plugin

Chuyu Zhang; Songyang Zhang; Yingfan Hu; Haowen Shen; Kuikun Liu,; Zerun Ma; Fengzhe Zhou; Wenwei Zhang; Xuming He; Dahua Lin; Kai Chen

arXiv:2407.10499·cs.CL·November 7, 2024·1 cites

CIBench: Evaluating Your LLMs with a Code Interpreter Plugin

Chuyu Zhang, Songyang Zhang, Yingfan Hu, Haowen Shen, Kuikun Liu,, Zerun Ma, Fengzhe Zhou, Wenwei Zhang, Xuming He, Dahua Lin, Kai Chen

PDF

Open Access 1 Repo

TL;DR

CIBench is an interactive evaluation framework designed to assess large language models' ability to utilize code interpreters for data science tasks, providing a comprehensive benchmark with real-world workflow simulation.

Contribution

This work introduces CIBench, a novel benchmarking framework with an evaluation dataset and modes, specifically targeting LLMs' code interpreter usage in data science workflows.

Findings

01

24 LLMs evaluated on CIBench with detailed performance analysis

02

Insights into strengths and limitations of LLMs in code interpreter tasks

03

Benchmarking results guide future development of LLMs for data science

Abstract

While LLM-Based agents, which use external tools to solve complex problems, have made significant progress, benchmarking their ability is challenging, thereby hindering a clear understanding of their limitations. In this paper, we propose an interactive evaluation framework, named CIBench, to comprehensively assess LLMs' ability to utilize code interpreters for data science tasks. Our evaluation framework includes an evaluation dataset and two evaluation modes. The evaluation dataset is constructed using an LLM-human cooperative approach and simulates an authentic workflow by leveraging consecutive and interactive IPython sessions. The two evaluation modes assess LLMs' ability with and without human assistance. We conduct extensive experiments to analyze the ability of 24 LLMs on CIBench and provide valuable insights for future LLMs in code interpreter utilization.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

open-compass/CIBench
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Library Science and Information Systems