FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs

Qian Chen; Jinlan Fu; Changsong Li; See-Kiong Ng; Xipeng Qiu

arXiv:2601.13836·cs.CL·January 21, 2026

FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs

Qian Chen, Jinlan Fu, Changsong Li, See-Kiong Ng, Xipeng Qiu

PDF

Open Access 1 Datasets

TL;DR

FutureOmni introduces a novel benchmark for evaluating the ability of multimodal large language models to predict future events from audio-visual data, highlighting current limitations and proposing training strategies to improve performance.

Contribution

The paper presents the first benchmark for omni-modal future forecasting, along with a new training method that enhances models' predictive capabilities in multimodal contexts.

Findings

01

Current models struggle with audio-visual future prediction, especially in speech-heavy scenarios.

02

The proposed OFF training strategy improves forecasting accuracy and generalization.

03

FutureOmni provides a comprehensive dataset for future research in multimodal forecasting.

Abstract

Although Multimodal Large Language Models (MLLMs) demonstrate strong omni-modal perception, their ability to forecast future events from audio-visual cues remains largely unexplored, as existing benchmarks focus mainly on retrospective understanding. To bridge this gap, we introduce FutureOmni, the first benchmark designed to evaluate omni-modal future forecasting from audio-visual environments. The evaluated models are required to perform cross-modal causal and temporal reasoning, as well as effectively leverage internal knowledge to predict future events. FutureOmni is constructed via a scalable LLM-assisted, human-in-the-loop pipeline and contains 919 videos and 1,034 multiple-choice QA pairs across 8 primary domains. Evaluations on 13 omni-modal and 7 video-only models show that current systems struggle with audio-visual future prediction, particularly in speech-heavy scenarios,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

OpenMOSS-Team/FutureOmni
dataset· 184 dl
184 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Multisensory perception and integration