Long-range Modeling and Processing of Multimodal Event Sequences

Jichu Li; Yilun Zhong; Zhiting Li; Feng Zhou; Quyu Kong

arXiv:2602.01125·cs.CL·February 3, 2026

Long-range Modeling and Processing of Multimodal Event Sequences

Jichu Li, Yilun Zhong, Zhiting Li, Feng Zhou, Quyu Kong

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a novel multimodal temporal point process framework that extends language models to incorporate visual data, enabling long-range understanding and generation of rich, multimodal event sequences.

Contribution

It presents a new approach combining sequence compression and a two-stage training paradigm to handle long multimodal sequences effectively.

Findings

01

Outperforms state-of-the-art in predictive accuracy

02

Generates high-quality textual analyses of event sequences

03

Effectively models long-range dependencies in multimodal data

Abstract

Temporal point processes (TPPs) have emerged as powerful tools for modeling asynchronous event sequences. While recent advances have extended TPPs to handle textual information, existing approaches are limited in their ability to generate rich, multimodal content and reason about event dynamics. A key challenge is that incorporating multimodal data dramatically increases sequence length, hindering the ability of attention-based models to generate coherent, long-form textual descriptions that require long-range understanding. In this paper, we propose a novel framework that extends LLM-based TPPs to the visual modality, positioning text generation as a core capability alongside time and type prediction. Our approach addresses the long-context problem through an adaptive sequence compression mechanism based on temporal similarity, which reduces sequence length while preserving essential…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 2

Strengths

- The extensive documentation of details of the experiment setup are appreciated. - The experiments are extensive. The authors compare against a range of TPP approaches, and select fair experiment details for an appropriate comparison. Selected metrics seem appropriate. - Standard deviations for experiment results are presented in Table 1 alongside key metrics. - The paper is well-written and very detailed. I do not have many questions after reading through the experiments section. - The abl

Weaknesses

- Given that only two benchmarks were selected for evaluation, it would be helpful to add a sentence or two explaining why these datasets were selected and why more are not needed for a fair evaluation. - Figure 1 is difficult to read due to the font choices and text size. - It would be interesting to consider other backbone architectures beyond Qwen-2.5 and its different sizes.

Reviewer 02Rating 6Confidence 1

Strengths

1. **First multi-modal TPP dataset.** This article proposed TAXI-PRO, which may be a useful testbed for future work. 2. **Simplicity and effectiveness.** The `<|similar_event|>` token is easy to implement and effectively increases the total event counts in a single context window.

Weaknesses

1. **Evaluation breadth.** Only DanmakuTPP is chosen as existing benchmarks to be evaluated, which could be biased. 2. **Lack to efficient MLLM baselines.** The author should include token pruning baselines as baseline methods (such as [1]), as they are closely related to the addressed topic. [1] An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models (ECCV 2024)

Reviewer 03Rating 6Confidence 2

Strengths

1. The paper is well organized. 2. The topic of using LLMs for multimodal event sequences is worth exploring in the research community. This paper targets this important problem. 3. The paper provides the code to help reviewers better understand of the proposed method.

Weaknesses

There are some concerns and questions about this paper: 1. In Section 4.3, the authors mention that temporal similarity between events can be calculated to reduce sequence length. However, how is this similarity calculated? The authors do not seem to mention the calculation method. 2. In video understanding, I learned that only 64 frames are needed to help the model understand what is happening in the video. So, for the video input mentioned in the paper, how many video frames are actually inp

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Emotion and Mood Recognition