Class-Imbalanced-Aware Adaptive Dataset Distillation for Scalable Pretrained Model on Credit Scoring

Xia Li; Hanghang Zheng; Xiwei Zhuang; Zhong Wang; Xiao Chen; Hong Liu; Jasmine Bai; Mao Mao

arXiv:2501.10677·cs.LG·March 31, 2026

Class-Imbalanced-Aware Adaptive Dataset Distillation for Scalable Pretrained Model on Credit Scoring

Xia Li, Hanghang Zheng, Xiwei Zhuang, Zhong Wang, Xiao Chen, Hong Liu, Jasmine Bai, Mao Mao

PDF

TL;DR

This paper introduces a novel framework combining dataset distillation and pretrained models to improve credit scoring on tabular data, addressing class imbalance issues for better scalability and performance.

Contribution

It proposes a new method that integrates class imbalance-aware dataset distillation with pretrained models, enhancing scalability and effectiveness in financial tabular data analysis.

Findings

01

Achieved a 2.5% improvement in AUC on financial datasets.

02

Demonstrated the effectiveness of imbalance-aware techniques in dataset distillation.

03

Enabled large pretrained models to better handle tabular credit scoring data.

Abstract

The advent of artificial intelligence has significantly enhanced credit scoring technologies. Despite the remarkable efficacy of advanced deep learning models, mainstream adoption continues to favor tree-structured models due to their robust predictive performance on tabular data. Although pretrained models have seen considerable development, their application within the financial realm predominantly revolves around question-answering tasks and the use of such models for tabular-structured credit scoring datasets remains largely unexplored. Tabular-oriented large models, such as TabPFN, has made the application of large models in credit scoring feasible, albeit can only processing with limited sample sizes. This paper provides a novel framework to combine tabular-tailored dataset distillation technique with the pretrained model, empowers the scalability for TabPFN. Furthermore, though…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.