# An AI-Driven Multimodal Sensor Fusion Framework for Fraud Perception in Short-Video and Live-Streaming Platforms

**Authors:** Ruixiang Zhao, Xuanhao Zhang, Jinfan Yang, Haofei Li, Zhengjia Lu, Wenrui Xu, Manzhou Li

PMC · DOI: 10.3390/s26051525 · Sensors (Basel, Switzerland) · 2026-02-28

## TL;DR

This paper introduces an AI framework that detects fraud in short-video and live-streaming platforms by analyzing multimodal sensor data over time.

## Contribution

A novel AI-driven multimodal sensor fusion framework for temporal fraud detection in short-video platforms is proposed.

## Key findings

- The framework achieves high accuracy (0.941) and AUC (0.956) on real-world datasets.
- It maintains strong performance in early-stage detection using only the first 30% of video content.
- The model outperforms text-based, vision-based, and conventional multimodal baselines.

## Abstract

With the rapid proliferation of short-video platforms and live-streaming commerce ecosystems, marketing activities are increasingly manifested through complex multimodal sensing signals. These heterogeneous sensor data streams exhibit strong temporal dependency, high cross-modal coupling, and progressive evolutionary characteristics, making early-stage fraud perception particularly challenging for conventional unimodal or static analytical paradigms. Existing approaches often fail to effectively capture weak anomalous cues emerging across multimodal channels during the initial stages of fraudulent campaigns. To address these limitations, an artificial intelligence-driven multimodal sensor perception framework is proposed for temporal fraud detection in short-video environments. A multimodal temporal alignment module is first designed to synchronize heterogeneous sensor signals with inconsistent sampling granularities. Subsequently, a shared temporal encoding network is constructed to learn evolution-aware representations across multimodal sensor sequences. On this basis, a cross-modal temporal attention fusion mechanism is introduced to dynamically weight sensor contributions at different behavioral stages. Finally, a fraud evolution modeling and early risk prediction module is developed to characterize the progressive intensification of fraudulent activities and to enable risk assessment under incomplete temporal observations. Extensive experiments conducted on real-world datasets collected from multiple mainstream short-video platforms demonstrate the effectiveness of the proposed AI-driven sensing framework. The model achieves an overall accuracy of 0.941, precision of 0.865, recall of 0.812, and F1 score of 0.838, with the AUC further reaching 0.956, significantly outperforming text-based, vision-based, temporal, and conventional multimodal baselines. In early-stage detection scenarios utilizing only the first 30% of video content, the framework maintains stable performance advantages, achieving a precision of 0.812, recall of 0.704, and F1 score of 0.754, validating its capability for proactive fraud warning.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12987105/full.md

## Figures

7 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12987105/full.md

## References

28 references — full list in the complete paper: https://tomesphere.com/paper/PMC12987105/full.md

---
Source: https://tomesphere.com/paper/PMC12987105