Investigating and Enhancing the Robustness of Large Multimodal Models Against Temporal Inconsistency

Jiafeng Liang; Shixin Jiang; Xuan Dong; Ning Wang; Zheng Chu; Hui Su; Jinlan Fu; Ming Liu; See-Kiong Ng; Bing Qin

arXiv:2505.14405·cs.CV·May 21, 2025

Investigating and Enhancing the Robustness of Large Multimodal Models Against Temporal Inconsistency

Jiafeng Liang, Shixin Jiang, Xuan Dong, Ning Wang, Zheng Chu, Hui Su, Jinlan Fu, Ming Liu, See-Kiong Ng, Bing Qin

PDF

Open Access 1 Video

TL;DR

This paper assesses the temporal robustness of large multimodal models in video understanding, introduces a benchmark for testing their resilience to temporal inconsistencies, and proposes a method to improve their temporal analysis capabilities.

Contribution

It introduces TemRobBench, a benchmark for evaluating temporal robustness, and PanoDPO, a training method to enhance models' temporal analysis robustness.

Findings

01

16 LMMs show over-reliance on textual context

02

Models ignore actual temporal dynamics in videos

03

PanoDPO improves robustness and reliability

Abstract

Large Multimodal Models (LMMs) have recently demonstrated impressive performance on general video comprehension benchmarks. Nevertheless, for broader applications, the robustness of their temporal analysis capability needs to be thoroughly investigated yet predominantly ignored. Motivated by this, we propose a novel temporal robustness benchmark (TemRobBench), which introduces temporal inconsistency perturbations separately at the visual and textual modalities to assess the robustness of models. We evaluate 16 mainstream LMMs and find that they exhibit over-reliance on prior knowledge and textual context in adversarial environments, while ignoring the actual temporal dynamics in the video. To mitigate this issue, we design panoramic direct preference optimization (PanoDPO), which encourages LMMs to incorporate both visual and linguistic feature preferences simultaneously. Experimental…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Investigating and Enhancing the Robustness of Large Multimodal Models Against Temporal Inconsistency· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling