TCL: Enabling Fast and Efficient Cross-Hardware Tensor Program Optimization via Continual Learning
Chaoyao Shen, Linfeng Jiang, Yixian Shen, Tao Xu, Guoqing Li, Anuj Pathania, Andy D. Pimentel, and Meng Zhang

TL;DR
TCL is a new compiler framework that uses continual learning and active data selection to optimize tensor programs efficiently across various hardware platforms, reducing costs and improving transferability.
Contribution
It introduces three core components: an active learning sampler, a lightweight cost model, and a knowledge distillation framework for cross-platform transfer learning.
Findings
Achieves 16.8x faster tuning on CPU and 12.48x on GPU.
Reduces data collection costs by selecting only 10% of programs.
Improves inference latency by 1.20x on CPU and 1.13x on GPU.
Abstract
Deep learning (DL) compilers rely on cost models and auto-tuning to optimize tensor programs for target hardware. However, existing approaches depend on large offline datasets, incurring high collection costs and offering suboptimal transferability across platforms. In this paper, we introduce TCL, a novel efficient and transferable compiler framework for fast tensor program optimization across diverse hardware platforms to address these challenges. Specifically, TCL is built on three core enablers: (1) the RDU Sampler, a data-efficient active learning strategy that selects only 10% of tensor programs by jointly optimizing Representativeness, Diversity, and Uncertainty, substantially reducing data collection costs while maintaining near-original model accuracy; (2) a new Mamba-based cost model that efficiently captures long-range schedule dependencies while achieving a favorable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
