Categorising SME Bank Transactions with Machine Learning and Synthetic Data Generation
Aluffi Pietro Alessandro, Brandi Jess, Marya Bazzi, Kate Kennedy, Matt Arderne, Daniel Rodrigues, Martin Lotz

TL;DR
This paper presents a machine learning pipeline that uses synthetic data generation to improve categorization of SME bank transactions, addressing challenges of data scarcity and imbalance for better lending decisions.
Contribution
It introduces a novel synthetic data generation method combined with a calibration approach to enhance transaction classification accuracy in SME lending.
Findings
Achieved 73.49% accuracy on real data
High-confidence predictions reach 90.36% accuracy
Model generalizes well across different SME types
Abstract
Despite their significant economic contributions, Small and Medium Enterprises (SMEs) face persistent barriers to securing traditional financing due to information asymmetries. Cash flow lending has emerged as a promising alternative, but its effectiveness depends on accurate modelling of transaction-level data. The main challenge in SME transaction analysis lies in the unstructured nature of textual descriptions, characterised by extreme abbreviations, limited context, and imbalanced label distributions. While consumer transaction descriptions often show significant commonalities across individuals, SME transaction descriptions are typically nonstandard and inconsistent across businesses and industries. To address some of these challenges, we propose a bank categorisation pipeline that leverages synthetic data generation to augment existing transaction data sets. Our approach comprises…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
