PreFM: Online Audio-Visual Event Parsing via Predictive Future Modeling
Xiao Yu, Yan Fang, Xiaojie Jin, Yao Zhao, Yunchao Wei

TL;DR
PreFM introduces a real-time online audio-visual event parsing framework that predicts future cues to improve understanding, outperforming existing methods with fewer parameters for multimodal video analysis.
Contribution
The paper proposes PreFM, a novel predictive future modeling approach enabling accurate, efficient online audio-visual event parsing with modality-agnostic representations.
Findings
PreFM outperforms state-of-the-art methods on UnAV-100 and LLP datasets.
PreFM achieves high accuracy with significantly fewer parameters.
PreFM demonstrates real-time processing capabilities for multimodal video understanding.
Abstract
Audio-visual event parsing plays a crucial role in understanding multimodal video content, but existing methods typically rely on offline processing of entire videos with huge model sizes, limiting their real-time applicability. We introduce Online Audio-Visual Event Parsing (On-AVEP), a novel paradigm for parsing audio, visual, and audio-visual events by sequentially analyzing incoming video streams. The On-AVEP task necessitates models with two key capabilities: (1) Accurate online inference, to effectively distinguish events with unclear and limited context in online settings, and (2) Real-time efficiency, to balance high performance with computational constraints. To cultivate these, we propose the Predictive Future Modeling (PreFM) framework featured by (a) predictive multimodal future modeling to infer and integrate beneficial future audio-visual cues, thereby enhancing contextual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
