DSMentor: Enhancing Data Science Agents with Curriculum Learning and Online Knowledge Accumulation

He Wang; Alexander Hanbo Li; Yiqun Hu; Sheng Zhang; Hideo Kobayashi; Jiani Zhang; Henry Zhu; Chung-Wei Hang; Patrick Ng

arXiv:2505.14163·cs.AI·May 21, 2025

DSMentor: Enhancing Data Science Agents with Curriculum Learning and Online Knowledge Accumulation

He Wang, Alexander Hanbo Li, Yiqun Hu, Sheng Zhang, Hideo Kobayashi, Jiani Zhang, Henry Zhu, Chung-Wei Hang, Patrick Ng

PDF

Open Access 3 Reviews

TL;DR

DSMentor introduces a curriculum learning framework for LLM data science agents, organizing tasks by difficulty and utilizing long-term memory to improve performance and reasoning on complex problems.

Contribution

This work presents a novel inference-time optimization method that applies curriculum learning and knowledge accumulation to enhance LLM agent effectiveness in data science tasks.

Findings

01

Up to 5.2% improvement in pass rate on benchmarks.

02

8.8% better causality reasoning compared to GPT-4.

03

Effective knowledge retention enhances problem-solving ability.

Abstract

Large language model (LLM) agents have shown promising performance in generating code for solving complex data science problems. Recent studies primarily focus on enhancing in-context learning through improved search, sampling, and planning techniques, while overlooking the importance of the order in which problems are tackled during inference. In this work, we develop a novel inference-time optimization framework, referred to as DSMentor, which leverages curriculum learning -- a strategy that introduces simpler task first and progressively moves to more complex ones as the learner improves -- to enhance LLM agent performance in challenging data science tasks. Our mentor-guided framework organizes data science tasks in order of increasing difficulty and incorporates a growing long-term memory to retain prior experiences, guiding the agent's learning progression and enabling more…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 5

Strengths

(1) The paper demonstrates the outstanding performance of DSMentor, which utilizes an easy-to-hard curriculum, across established benchmarks in data analysis and causal reasoning tasks. (2) The paper shows that by retrieving easier and more relevant examples from memory and structuring the knowledge in an increasing-similarity order, DSMentor significantly improves data science agents' understanding. (3) The paper demonstrates how DSMentor incrementally introduces tasks with increasing compl

Weaknesses

(1) While the paper presents promising results, it would be more compelling if the experiments were repeated multiple times to establish a mean performance. This would provide a clearer picture of the consistency and reliability of the proposed method, DSMentor, across different runs and potentially highlight any variability in its performance. (2) A more comprehensive exploration of the shortcomings, including the reasons behind them and possible mitigation strategies, would strengthen the pap

Reviewer 02Rating 3Confidence 4

Strengths

1. DSMentor introduces an inference-time curriculum learning strategy that effectively mirrors human learning, gradually increasing task difficulty to improve the model’s problem-solving abilities. 2. The paper provides thorough experimentation across multiple benchmarks (DSEval, QRData), demonstrating consistent improvements in both general data science tasks and causal reasoning problems.

Weaknesses

1. DSMentor’s performance may degrade when no similar questions exist in memory, and it lacks a clear strategy to handle novel tasks without prior examples. Additionally, as memory grows, retrieval efficiency may become an issue, but the paper doesn’t address how to manage this scalability challenge effectively. 2. The mentor agent’s ability to assign consistent difficulty ratings across diverse tasks is questionable. The generality of the Difficulty Scale Guidelines may lead to inaccurate asse

Reviewer 03Rating 5Confidence 4

Strengths

The paper demonstrates that the method generally outperforms several appropriate baselines and shows that the design is well-motivated through comprehensive ablations. Overall, the experiments are thorough and their presentation is clear. Using learning curricula in the context of data science problems is well-motivated, given the way that complex tasks can be decomposed into simpler ones. The method is simple and appears to be effective.

Weaknesses

AIDE [1] is a leading baseline for data science LLM agents that has not been considered. In-context curricula for LLMs have been used in several other settings [2, 3], which brings into question the novelty of this work. Especially, given that nothing about the method seems data science specific, besides the difficulty scale guidelines. [1] D. Schmidt, Z. Jiang and Y. Wu, AIDE: Human-Level Performance in Data Science Competitions, weco.ai, url: https://www.weco.ai/blog/technical-report, 2024

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsOnline Learning and Analytics · Big Data and Business Intelligence

MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Multi-Head Attention · Dense Connections · Softmax · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Residual Connection · Byte Pair Encoding