ShapleyPipe: Hierarchical Shapley Search for Data Preparation Pipeline Construction
Jing Chang, Chang Liu, Jinbin Huang, Shuyuan Zheng, Rui Mao, Jianbin Qin

TL;DR
ShapleyPipe introduces a hierarchical, game-theoretic approach to automate and interpret data pipeline construction, significantly reducing search complexity and improving performance over existing methods.
Contribution
It presents a novel hierarchical framework using Shapley values for interpretable, efficient data pipeline search, with new mechanisms for tractable Shapley computation.
Findings
Achieves 98.1% of high-budget baseline performance with 24% fewer evaluations.
Outperforms state-of-the-art reinforcement learning methods by 3.6%.
Provides highly interpretable operator valuations with 0.933 correlation to empirical performance.
Abstract
Automated data preparation pipeline construction is critical for machine learning success, yet existing methods suffer from two fundamental limitations: they treat pipeline construction as black-box optimization without quantifying individual operator contributions, and they struggle with the combinatorial explosion of the search space ( configurations for N operators and pipeline length M). We introduce ShapleyPipe, a principled framework that leverages game-theoretic Shapley values to systematically quantify each operator's marginal contribution while maintaining full interpretability. Our key innovation is a hierarchical decomposition that separates category-level structure search from operator-level refinement, reducing the search complexity from exponential to polynomial. To make Shapley computation tractable, we develop: (1) a Multi-Armed Bandit mechanism for intelligent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Machine Learning and Data Classification · Advanced Multi-Objective Optimization Algorithms
