Scale-Aware Vision-Language Adaptation for Extreme Far-Distance Video Person Re-identification
Ashwat Rajbhandari, Bharatesh Chakravarthi

TL;DR
This paper enhances extreme far-distance video person re-identification by adapting large-scale vision-language models with stability-focused techniques, improving robustness under challenging conditions.
Contribution
It introduces a scale-aware adaptation framework for large vision-language models, including backbone upgrading, selective fine-tuning, temporal attention pooling, and improved re-ranking.
Findings
Achieved mAP of 46.69 on A2G, 41.23 on G2A, and 22.98 on A2A benchmarks.
Large-scale vision-language models with adaptation significantly improve robustness in extreme far-distance ReID.
Proposed methods outperform baseline models on the DetReIDX benchmark.
Abstract
Extreme far-distance video person re-identification (ReID) is particularly challenging due to scale compression, resolution degradation, motion blur, and aerial-ground viewpoint mismatch. As camera altitude and subject distance increase, models trained on close-range imagery degrade significantly. In this work, we investigate how large-scale vision-language models can be adapted to operate reliably under these conditions. Starting from a CLIP-based baseline, we upgrade the visual backbone from ViT-B/16 to ViT-L/14 and introduce backbone-aware selective fine-tuning to stabilize adaptation of the larger transformer. To address noisy and low-resolution tracklets, we incorporate a lightweight temporal attention pooling mechanism that suppresses degraded frames and emphasizes informative observations. We retain adapter-based and prompt-conditioned cross-view learning to mitigate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
