PreFM: Online Audio-Visual Event Parsing via Predictive Future Modeling

Xiao Yu; Yan Fang; Xiaojie Jin; Yao Zhao; Yunchao Wei

arXiv:2505.23155·cs.CV·October 24, 2025

PreFM: Online Audio-Visual Event Parsing via Predictive Future Modeling

Xiao Yu, Yan Fang, Xiaojie Jin, Yao Zhao, Yunchao Wei

PDF

Open Access 1 Repo 1 Datasets

TL;DR

PreFM introduces a real-time online audio-visual event parsing framework that predicts future cues to improve understanding, outperforming existing methods with fewer parameters for multimodal video analysis.

Contribution

The paper proposes PreFM, a novel predictive future modeling approach enabling accurate, efficient online audio-visual event parsing with modality-agnostic representations.

Findings

01

PreFM outperforms state-of-the-art methods on UnAV-100 and LLP datasets.

02

PreFM achieves high accuracy with significantly fewer parameters.

03

PreFM demonstrates real-time processing capabilities for multimodal video understanding.

Abstract

Audio-visual event parsing plays a crucial role in understanding multimodal video content, but existing methods typically rely on offline processing of entire videos with huge model sizes, limiting their real-time applicability. We introduce Online Audio-Visual Event Parsing (On-AVEP), a novel paradigm for parsing audio, visual, and audio-visual events by sequentially analyzing incoming video streams. The On-AVEP task necessitates models with two key capabilities: (1) Accurate online inference, to effectively distinguish events with unclear and limited context in online settings, and (2) Real-time efficiency, to balance high performance with computational constraints. To cultivate these, we propose the Predictive Future Modeling (PreFM) framework featured by (a) predictive multimodal future modeling to infer and integrate beneficial future audio-visual cues, thereby enhancing contextual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xiaoyu-1123/prefm
noneOfficial

Datasets

Yang1213112131/PreFM
dataset· 28 dl
28 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning