TL;DR
DanmakuTPPBench introduces a multi-modal benchmark with datasets and evaluation protocols to advance temporal point process modeling involving temporal, textual, and visual data, addressing limitations of unimodal datasets.
Contribution
It provides the first comprehensive multi-modal TPP benchmark with datasets derived from real-world video comments and a multi-agent pipeline for complex reasoning tasks.
Findings
Current TPP models struggle with multi-modal data.
Large Language Models show potential but have limitations in multi-modal TPP tasks.
Benchmark reveals significant gaps in existing methods' capabilities.
Abstract
We introduce DanmakuTPPBench, a comprehensive benchmark designed to advance multi-modal Temporal Point Process (TPP) modeling in the era of Large Language Models (LLMs). While TPPs have been widely studied for modeling temporal event sequences, existing datasets are predominantly unimodal, hindering progress in models that require joint reasoning over temporal, textual, and visual information. To address this gap, DanmakuTPPBench comprises two complementary components: (1) DanmakuTPP-Events, a novel dataset derived from the Bilibili video platform, where user-generated bullet comments (Danmaku) naturally form multi-modal events annotated with precise timestamps, rich textual content, and corresponding video frames; (2) DanmakuTPP-QA, a challenging question-answering dataset constructed via a novel multi-agent pipeline powered by state-of-the-art LLMs and multi-modal LLMs (MLLMs),…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
