Multimodal Banking Dataset: Understanding Client Needs through Event Sequences

Dzhambulat Mollaev; Alexander Kostin; Maria Postnova; Ivan Karpukhin; Ivan Kireev; Gleb Gusev; Andrey Savchenko

arXiv:2409.17587·cs.LG·June 3, 2025

Multimodal Banking Dataset: Understanding Client Needs through Event Sequences

Dzhambulat Mollaev, Alexander Kostin, Maria Postnova, Ivan Karpukhin, Ivan Kireev, Gleb Gusev, Andrey Savchenko

PDF

Open Access 1 Repo 5 Datasets 3 Reviews

TL;DR

This paper introduces MBD, a large-scale, anonymized multimodal banking dataset with over 2 million clients, enabling advanced research in event sequence modeling for financial applications.

Contribution

The paper presents the first open-source, industrial-scale multimodal banking dataset and a benchmark for practical tasks like purchase prediction and modality matching.

Findings

01

Fusion models outperform single-modal techniques.

02

State-of-the-art event sequence models perform well on downstream tasks.

03

Anonymization preserves essential information for analysis.

Abstract

Financial organizations collect a huge amount of temporal (sequential) data about clients, which is typically collected from multiple sources (modalities). Despite the urgent practical need, developing deep learning techniques suitable to handle such data is limited by the absence of large open-source multi-source real-world datasets of event sequences. To fill this gap, which is mainly caused by security reasons, we present the first industrial-scale publicly available multimodal banking dataset, MBD, that contains information on more than 2M corporate clients of a large bank. Clients are represented by several data sources: 950M bank transactions, 1B geo position events, 5M embeddings of dialogues with technical support, and monthly aggregated purchases of four bank products. All entries are properly anonymized from real proprietary bank data, and the experiments confirm that our…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 6Confidence 3

Strengths

1. The work releases a first large scale banking dataset for public availability for financial applications. 2. The authors present a good benchmark comparing unimodal and multimodal methods across various predictive tasks. The experimental protocol and metrics are clearly laid out as well.

Weaknesses

1. The authors do not explore or discuss advanced multimodal sequence models or advanced fusion techniques' cross-attention mechanisms as they can better capture interactions across modalities. They mention it at the end as a scope for future work. 2. Though authors discuss using AUC ROC as their metric for mitigating label imbalance issues for example in their campaigning downstream task, they do not discuss or incorporate any additional techniques for handling the label imbalance. 3. Details a

Reviewer 02Rating 8Confidence 4

Strengths

(1) This will be first and the largest multimodal banking dataset that will be released. This can potentially be tremendously useful to the research community. (2) The baseline methods and benchmark data for a few problems outlined will also be immensely useful to the research community.

Weaknesses

(1) The details of the data are somewhat sparse. More details of each type of data will be useful to the reader. Perhaps this article may be useful for improving this aspect of the exposition in the paper: https://cacm.acm.org/research/datasheets-for-datasets/ (2) Can some details of the anonymization be provided without compromising on the privacy of the customers? That can help estimate the errors of any model developed using this data. (3) The data has been collected during the pandemic

Reviewer 03Rating 3Confidence 5

Strengths

1.A large-scale multimodal banking dataset, MBD, is provided. This dataset contains anonymized banking transactions, geographic locations, and technical support dialogues, which contributes to the development of large-scale sequential event tasks in the future. 2.The dataset addresses privacy concerns through effective data anonymization, ensuring that the algorithm's performance is not significantly compromised. 3.The dataset and experimental code are publicly available, promoting transparency

Weaknesses

1.The dataset's multimodal data includes banking transaction records, geographic locations, dialogue embeddings, and banking product purchase history. However, it appears that many of these modalities are essentially text-based. This differs from typical multimodal datasets, which include modalities such as video, audio, and text. 2.The main contribution of the paper lies in the introduction of a large-scale dataset, but it lacks innovative methods for addressing related tasks. Additionally, t

Code & Models

Repositories

dzhambo/mbd
pytorchOfficial

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCustomer churn and segmentation · Business Process Modeling and Analysis · Big Data and Business Intelligence