Let's Split Up: Zero-Shot Classifier Edits for Fine-Grained Video Understanding
Kaiting Liu, Hazel Doughty

TL;DR
This paper introduces a zero-shot classifier editing technique for fine-grained video understanding, enabling the refinement of coarse categories into subcategories without additional data, improving accuracy and adaptability.
Contribution
The paper proposes a novel zero-shot editing method for classifiers that leverages latent structure to refine categories, along with a new benchmark for category splitting in videos.
Findings
Outperforms vision-language baselines in category splitting tasks
Significantly improves accuracy on newly split categories
Zero-shot initialization benefits low-shot fine-tuning
Abstract
Video recognition models are typically trained on fixed taxonomies which are often too coarse, collapsing distinctions in object, manner or outcome under a single label. As tasks and definitions evolve, such models cannot accommodate emerging distinctions and collecting new annotations and retraining to accommodate such changes is costly. To address these challenges, we introduce category splitting, a new task where an existing classifier is edited to refine a coarse category into finer subcategories, while preserving accuracy elsewhere. We propose a zero-shot editing method that leverages the latent compositional structure of video classifiers to expose fine-grained distinctions without additional data. We further show that low-shot fine-tuning, while simple, is highly effective and benefits from our zero-shot initialization. Experiments on our new video benchmarks for category…
Peer Reviews
Decision·ICLR 2026 Poster
The method shows good improvement over baselines. The authors also provide many ablations such as choice of encoder, pretraining among others. The dataset generated has also been provided fully aiding in transparency. The concept of zero shot adaptation to finer granularity levels is interesting and warrants more attention.
The paper uses the term "video model" very generally to refer specifically to a kind of video-language model. This is misleading as "video model" can refer to other concepts such as video generation models. The method is very hard to follow as the authors do not provide any preliminary information of the architecture that they are based upon. Eg. it is hard to follow which are the weight vectors that are being referred to as additive. Subjective: The title of the paper does not convey the prob
- Proposes and explores a novel problem of clear practical relevance - The work explores different forms of class splitting, both where the modifiers are seen or unseen - The proposed solution is tidy and lightweight: they edit only the classifier head, reuse structure already present in the model by building a modifier dictionary from existing labels - The authors construct 2 datasets for this problem from SSv2 and FineGym, which pose challenging test cases
- Method depends significantly on the assumption that the original label space already contains enough compositional variation to learn good modifier vectors - Because the edit happens at the classifier head, it also assumes the backbone already captures the visual distinctions the new sublabels require; if the new split introduces visual novelty rather than just semantic refinement, a head-only edit will struggle, and the paper doesn’t really explore that failure mode - There is some amount o
-- Practical & Elegant Solution: Addresses real problem of fine-grained classification without expensive retraining - just intelligent matrix manipulation == Compositional Approach: The weight arithmetic (w_subcategory = w_coarse + v_modifier) is intuitive and enables systematic fine-grained category generation
-- Scalability Questions: Method requires existing fine-grained examples to extract modifiers, and matrix growth (100→150 categories) may not scale to truly large taxonomies -- Clarity and Organization Issues: Paper was slightly difficult to follow on the main contribution and method - would benefit from restructuring for better readability (more intuitive figures perhaps?) -- Insufficient Baseline Comparisons: Only compares against basic vision-language models (CLIP, VideoCLIP) rather than es
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Face recognition and analysis
