Big Data-Driven Fraud Detection Using Machine Learning and Real-Time Stream Processing

Chen Liu; Hengyu Tang; Zhixiao Yang; Ke Zhou; Sangwhan Cha

arXiv:2506.02008·cs.DC·June 4, 2025

Big Data-Driven Fraud Detection Using Machine Learning and Real-Time Stream Processing

Chen Liu, Hengyu Tang, Zhixiao Yang, Ke Zhou, Sangwhan Cha

PDF

Open Access

TL;DR

This paper introduces a scalable Big Data and machine learning-based system for real-time fraud detection in digital finance, achieving high accuracy using streaming platforms and distributed processing.

Contribution

It presents an integrated architecture combining Apache Kafka, Flink, Spark, and cloud storage with machine learning models for real-time fraud detection, which is novel in its scalability and efficiency.

Findings

01

Over 99% classification accuracy achieved

02

Effective real-time detection of fraudulent transactions

03

Demonstrates scalability with Big Data tools

Abstract

In the age of digital finance, detecting fraudulent transactions and money laundering is critical for financial institutions. This paper presents a scalable and efficient solution using Big Data tools and machine learning models. We utilize realtime data streaming platforms like Apache Kafka and Flink, distributed processing frameworks such as Apache Spark, and cloud storage services AWS S3 and RDS. A synthetic dataset representing real-world Anti-Money Laundering (AML) challenges is employed to build a binary classification model. Logistic Regression, Decision Tree, and Random Forest are trained and evaluated using engineered features. Our system demonstrates over 99% classification accuracy, illustrating the power of combining Big Data architectures with machine learning to tackle fraud.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImbalanced Data Classification Techniques · Data Stream Mining Techniques