EmDT: Embedding Diffusion Transformer for Tabular Data Generation in Fraud Detection
En-Ya Kuo, Sebastien Motsch

TL;DR
EmDT is a novel diffusion-based Transformer model that generates synthetic fraudulent transaction data to improve fraud detection on imbalanced tabular datasets.
Contribution
The paper introduces EmDT, leveraging UMAP clustering and a Transformer denoising network for effective synthetic fraud data generation.
Findings
EmDT significantly improves classification performance over existing methods.
Synthetic data preserves feature correlations and privacy.
EmDT outperforms traditional oversampling and generative approaches.
Abstract
Imbalanced datasets pose a difficulty in fraud detection, as classifiers are often biased toward the majority class and perform poorly on rare fraudulent transactions. Synthetic data generation is therefore commonly used to mitigate this problem. In this work, we propose the Clustered Embedding Diffusion-Transformer (EmDT), a diffusion model designed to generate fraudulent samples. Our key innovation is to leverage UMAP clustering to identify distinct fraudulent patterns, and train a Transformer denoising network with sinusoidal positional embeddings to capture feature relationships throughout the diffusion process. Once the synthetic data has been generated, we employ a standard decision-tree-based classifier (e.g., XGBoost) for classification, as this type of model remains better suited to tabular datasets. Experiments on a credit card fraud detection dataset demonstrate that EmDT…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
