LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding

Hongyu Li; Jinyu Chen; Ziyu Wei; Shaofei Huang; Tianrui Hui; Jialin Gao; Xiaoming Wei; and Si Liu

arXiv:2501.08282·cs.CV·June 3, 2025

LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding

Hongyu Li, Jinyu Chen, Ziyu Wei, Shaofei Huang, Tianrui Hui, Jialin Gao, Xiaoming Wei, and Si Liu

PDF

1 Repo

TL;DR

LLaVA-ST is a multimodal large language model designed for detailed spatial-temporal understanding in videos, introducing new embedding and attention techniques along with a large dataset and benchmark for evaluation.

Contribution

It introduces novel spatial-temporal embedding and attention mechanisms, along with a large dataset and benchmark, to improve fine-grained spatial-temporal multimodal understanding.

Findings

01

Achieves state-of-the-art results on 11 benchmarks.

02

Introduces ST-Align dataset with 4.3 million samples.

03

Develops a progressive training pipeline for spatial-temporal alignment.

Abstract

Recent advancements in multimodal large language models (MLLMs) have shown promising results, yet existing approaches struggle to effectively handle both temporal and spatial localization simultaneously. This challenge stems from two key issues: first, incorporating spatial-temporal localization introduces a vast number of coordinate combinations, complicating the alignment of linguistic and visual coordinate representations; second, encoding fine-grained temporal and spatial information during video feature compression is inherently difficult. To address these issues, we propose LLaVA-ST, a MLLM for fine-grained spatial-temporal multimodal understanding. In LLaVA-ST, we propose Language-Aligned Positional Embedding, which embeds the textual coordinate special token into the visual space, simplifying the alignment of fine-grained spatial-temporal correspondences. Additionally, we design…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

appletea233/llava-st
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSoftmax · Attention Is All You Need