InstAP: Instance-Aware Vision-Language Pre-Train for Spatial-Temporal Understanding

Ashutosh Kumar; Rajat Saini; Jingjing Pan; Mustafa Erdogan; Mingfang Zhang; Betty Le Dem; Norimasa Kobori; Quan Kong

arXiv:2604.08337·cs.CV·April 10, 2026

InstAP: Instance-Aware Vision-Language Pre-Train for Spatial-Temporal Understanding

Ashutosh Kumar, Rajat Saini, Jingjing Pan, Mustafa Erdogan, Mingfang Zhang, Betty Le Dem, Norimasa Kobori, Quan Kong

PDF

1 Datasets

TL;DR

InstAP introduces an instance-aware vision-language pre-training framework that enhances spatial-temporal understanding by grounding textual mentions to specific regions, outperforming existing models in instance-level retrieval and zero-shot tasks.

Contribution

The paper proposes InstAP, a novel pre-training method with a large-scale dataset, enabling joint global and fine-grained instance-level alignment for improved vision-language understanding.

Findings

01

Outperforms existing VLP models on instance-level retrieval tasks.

02

Achieves competitive zero-shot performance on multiple video benchmarks.

03

Effectively localizes textual mentions to correct spatial-temporal instances.

Abstract

Current vision-language pre-training (VLP) paradigms excel at global scene understanding but struggle with instance-level reasoning due to global-only supervision. We introduce InstAP, an Instance-Aware Pre-training framework that jointly optimizes global vision-text alignment and fine-grained, instance-level contrastive alignment by grounding textual mentions to specific spatial-temporal regions. To support this, we present InstVL, a large-scale dataset (2 million images, 50,000 videos) with dual-granularity annotations: holistic scene captions and dense, grounded instance descriptions. On the InstVL benchmark, InstAP substantially outperforms existing VLP models on instance-level retrieval, and also surpasses a strong VLP baseline trained on the exact same data corpus, isolating the benefit of our instance-aware objective. Moreover, instance-centric pre-training improves global…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

wovenbytoyota-vai/InstVL
dataset· 216 dl
216 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.