Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study

Yuqi Zhu; Yi Zhong; Jintian Zhang; Ziheng Zhang; Shuofei Qiao; Yujie Luo; Lun Du; Da Zheng; Ningyu Zhang; Huajun Chen

arXiv:2506.19794·cs.CL·November 14, 2025

Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study

Yuqi Zhu, Yi Zhong, Jintian Zhang, Ziheng Zhang, Shuofei Qiao, Yujie Luo, Lun Du, Da Zheng, Ningyu Zhang, Huajun Chen

PDF

Open Access 2 Models 1 Datasets 1 Video

TL;DR

This paper systematically evaluates open-source LLMs' data analysis abilities, identifying key factors affecting performance and proposing a data synthesis method to improve reasoning in complex, reasoning-intensive tasks.

Contribution

It introduces a comprehensive evaluation framework and a novel data synthesis approach to enhance open-source LLMs' data analysis capabilities.

Findings

01

Strategic planning quality is the main performance factor.

02

Interaction design and task complexity affect reasoning.

03

Data quality impacts performance more than data diversity.

Abstract

Large Language Models (LLMs) hold promise in automating data analysis tasks, yet open-source models face significant limitations in these kinds of reasoning-intensive scenarios. In this work, we investigate strategies to enhance the data analysis capabilities of open-source LLMs. By curating a seed dataset of diverse, realistic scenarios, we evaluate model behavior across three core dimensions: data understanding, code generation, and strategic planning. Our analysis reveals three key findings: (1) Strategic planning quality serves as the primary determinant of model performance; (2) Interaction design and task complexity significantly influence reasoning capabilities; (3) Data quality demonstrates a greater impact than diversity in achieving optimal performance. We leverage these insights to develop a data synthesis methodology, demonstrating significant improvements in open-source…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

zjunlp/DataMind-Analysis-SFT-Data
dataset· 57 dl
57 dl

Videos

Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study· underline

Taxonomy

TopicsOpen Education and E-Learning · Open Source Software Innovations