EventFormer: A Node-graph Hierarchical Attention Transformer for Action-centric Video Event Prediction
Qile Su, Shoutai Zhu, Shuai Zhang, Baoyu Liang, Chao Tong

TL;DR
EventFormer introduces a hierarchical attention transformer for action-centric video event prediction, leveraging a new structured dataset with rich annotations to improve understanding of complex video events.
Contribution
The paper presents EventFormer, a novel node-graph hierarchical attention model, and introduces AVEP, a large dataset with fine-grained multimodal annotations for video event prediction.
Findings
EventFormer outperforms state-of-the-art video prediction models.
The AVEP dataset enables better structured representation of video events.
Traditional visual models struggle with complex event structures.
Abstract
Script event induction, which aims to predict the subsequent event based on the context, is a challenging task in NLP, achieving remarkable success in practical applications. However, human events are mostly recorded and presented in the form of videos rather than scripts, yet there is a lack of related research in the realm of vision. To address this problem, we introduce AVEP (Action-centric Video Event Prediction), a task that distinguishes itself from existing video prediction tasks through its incorporation of more complex logic and richer semantic information. We present a large structured dataset, which consists of about annotated videos and more than video clips of event, built upon existing video event datasets to support this task. The dataset offers more fine-grained annotations, where the atomic unit is represented as a multimodal event argument node, providing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
