TEST-V: TEst-time Support-set Tuning for Zero-shot Video Classification
Rui Yan, Jin Wang, Hongyu Qu, Xiaoyu Du, Dong Zhang, Jinhui Tang and, Tieniu Tan

TL;DR
TEST-V introduces a novel zero-shot video classification framework that dynamically enhances and tunes support-sets using multi-prompting and learnable erosion, achieving state-of-the-art results with interpretability.
Contribution
It proposes a new framework combining support-set dilation and erosion for improved zero-shot video classification, addressing semantic gaps and support-set tuning limitations.
Findings
Achieves state-of-the-art results on four benchmarks.
Enables dynamic support-set enhancement and tuning.
Provides interpretable support-set modifications.
Abstract
Recently, adapting Vision Language Models (VLMs) to zero-shot visual classification by tuning class embedding with a few prompts (Test-time Prompt Tuning, TPT) or replacing class names with generated visual samples (support-set) has shown promising results. However, TPT cannot avoid the semantic gap between modalities while the support-set cannot be tuned. To this end, we draw on each other's strengths and propose a novel framework namely TEst-time Support-set Tuning for zero-shot Video Classification (TEST-V). It first dilates the support-set with multiple prompts (Multi-prompting Support-set Dilation, MSD) and then erodes the support-set via learnable weights to mine key cues dynamically (Temporal-aware Support-set Erosion, TSE). Specifically, i) MSD expands the support samples for each class based on multiple prompts enquired from LLMs to enrich the diversity of the support-set. ii)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCOVID-19 diagnosis using AI · Anomaly Detection Techniques and Applications · Domain Adaptation and Few-Shot Learning
