Categorising SME Bank Transactions with Machine Learning and Synthetic Data Generation

Aluffi Pietro Alessandro; Brandi Jess; Marya Bazzi; Kate Kennedy; Matt Arderne; Daniel Rodrigues; Martin Lotz

arXiv:2508.05425·cs.CE·August 8, 2025

Categorising SME Bank Transactions with Machine Learning and Synthetic Data Generation

Aluffi Pietro Alessandro, Brandi Jess, Marya Bazzi, Kate Kennedy, Matt Arderne, Daniel Rodrigues, Martin Lotz

PDF

TL;DR

This paper presents a machine learning pipeline that uses synthetic data generation to improve categorization of SME bank transactions, addressing challenges of data scarcity and imbalance for better lending decisions.

Contribution

It introduces a novel synthetic data generation method combined with a calibration approach to enhance transaction classification accuracy in SME lending.

Findings

01

Achieved 73.49% accuracy on real data

02

High-confidence predictions reach 90.36% accuracy

03

Model generalizes well across different SME types

Abstract

Despite their significant economic contributions, Small and Medium Enterprises (SMEs) face persistent barriers to securing traditional financing due to information asymmetries. Cash flow lending has emerged as a promising alternative, but its effectiveness depends on accurate modelling of transaction-level data. The main challenge in SME transaction analysis lies in the unstructured nature of textual descriptions, characterised by extreme abbreviations, limited context, and imbalanced label distributions. While consumer transaction descriptions often show significant commonalities across individuals, SME transaction descriptions are typically nonstandard and inconsistent across businesses and industries. To address some of these challenges, we propose a bank categorisation pipeline that leverages synthetic data generation to augment existing transaction data sets. Our approach comprises…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.