TL;DR
LLaVA-ST is a multimodal large language model designed for detailed spatial-temporal understanding in videos, introducing new embedding and attention techniques along with a large dataset and benchmark for evaluation.
Contribution
It introduces novel spatial-temporal embedding and attention mechanisms, along with a large dataset and benchmark, to improve fine-grained spatial-temporal multimodal understanding.
Findings
Achieves state-of-the-art results on 11 benchmarks.
Introduces ST-Align dataset with 4.3 million samples.
Develops a progressive training pipeline for spatial-temporal alignment.
Abstract
Recent advancements in multimodal large language models (MLLMs) have shown promising results, yet existing approaches struggle to effectively handle both temporal and spatial localization simultaneously. This challenge stems from two key issues: first, incorporating spatial-temporal localization introduces a vast number of coordinate combinations, complicating the alignment of linguistic and visual coordinate representations; second, encoding fine-grained temporal and spatial information during video feature compression is inherently difficult. To address these issues, we propose LLaVA-ST, a MLLM for fine-grained spatial-temporal multimodal understanding. In LLaVA-ST, we propose Language-Aligned Positional Embedding, which embeds the textual coordinate special token into the visual space, simplifying the alignment of fine-grained spatial-temporal correspondences. Additionally, we design…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSoftmax · Attention Is All You Need
