Scale-Aware Vision-Language Adaptation for Extreme Far-Distance Video Person Re-identification

Ashwat Rajbhandari; Bharatesh Chakravarthi

arXiv:2604.04183·cs.CV·April 7, 2026

Scale-Aware Vision-Language Adaptation for Extreme Far-Distance Video Person Re-identification

Ashwat Rajbhandari, Bharatesh Chakravarthi

PDF

TL;DR

This paper enhances extreme far-distance video person re-identification by adapting large-scale vision-language models with stability-focused techniques, improving robustness under challenging conditions.

Contribution

It introduces a scale-aware adaptation framework for large vision-language models, including backbone upgrading, selective fine-tuning, temporal attention pooling, and improved re-ranking.

Findings

01

Achieved mAP of 46.69 on A2G, 41.23 on G2A, and 22.98 on A2A benchmarks.

02

Large-scale vision-language models with adaptation significantly improve robustness in extreme far-distance ReID.

03

Proposed methods outperform baseline models on the DetReIDX benchmark.

Abstract

Extreme far-distance video person re-identification (ReID) is particularly challenging due to scale compression, resolution degradation, motion blur, and aerial-ground viewpoint mismatch. As camera altitude and subject distance increase, models trained on close-range imagery degrade significantly. In this work, we investigate how large-scale vision-language models can be adapted to operate reliably under these conditions. Starting from a CLIP-based baseline, we upgrade the visual backbone from ViT-B/16 to ViT-L/14 and introduce backbone-aware selective fine-tuning to stabilize adaptation of the larger transformer. To address noisy and low-resolution tracklets, we incorporate a lightweight temporal attention pooling mechanism that suppresses degraded frames and emphasizes informative observations. We retain adapter-based and prompt-conditioned cross-view learning to mitigate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.