Enhancing Temporal Understanding in Video-LLMs through Stacked Temporal Attention in Vision Encoders

Ali Rasekh; Erfan Bagheri Soula; Omid Daliran; Simon Gottschalk; Mohsen Fayyaz

arXiv:2510.26027·cs.CV·October 31, 2025

Enhancing Temporal Understanding in Video-LLMs through Stacked Temporal Attention in Vision Encoders

Ali Rasekh, Erfan Bagheri Soula, Omid Daliran, Simon Gottschalk, Mohsen Fayyaz

PDF

1 Video

TL;DR

This paper introduces stacked temporal attention modules within vision encoders of Video-LLMs, significantly improving their ability to understand complex temporal dynamics in videos and outperforming existing models on key benchmarks.

Contribution

The novel architecture integrates stacked temporal attention into vision encoders, enhancing temporal reasoning in Video-LLMs for better action sequence comprehension.

Findings

01

Improved performance on VITATECS, MVBench, Video-MME benchmarks by up to +5.5%.

02

Enhanced temporal reasoning in video question answering tasks.

03

Addresses a critical gap in video understanding for Video-LLMs.

Abstract

Despite significant advances in Multimodal Large Language Models (MLLMs), understanding complex temporal dynamics in videos remains a major challenge. Our experiments show that current Video Large Language Model (Video-LLM) architectures have critical limitations in temporal understanding, struggling with tasks that require detailed comprehension of action sequences and temporal progression. In this work, we propose a Video-LLM architecture that introduces stacked temporal attention modules directly within the vision encoder. This design incorporates a temporal attention in vision encoder, enabling the model to better capture the progression of actions and the relationships between frames before passing visual tokens to the LLM. Our results show that this approach significantly improves temporal reasoning and outperforms existing models in video question answering tasks, specifically in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Enhancing Temporal Understanding in Video-LLMs through Stacked Temporal Attention in Vision Encoders· slideslive