Let's Split Up: Zero-Shot Classifier Edits for Fine-Grained Video Understanding

Kaiting Liu; Hazel Doughty

arXiv:2602.16545·cs.CV·February 19, 2026

Let's Split Up: Zero-Shot Classifier Edits for Fine-Grained Video Understanding

Kaiting Liu, Hazel Doughty

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a zero-shot classifier editing technique for fine-grained video understanding, enabling the refinement of coarse categories into subcategories without additional data, improving accuracy and adaptability.

Contribution

The paper proposes a novel zero-shot editing method for classifiers that leverages latent structure to refine categories, along with a new benchmark for category splitting in videos.

Findings

01

Outperforms vision-language baselines in category splitting tasks

02

Significantly improves accuracy on newly split categories

03

Zero-shot initialization benefits low-shot fine-tuning

Abstract

Video recognition models are typically trained on fixed taxonomies which are often too coarse, collapsing distinctions in object, manner or outcome under a single label. As tasks and definitions evolve, such models cannot accommodate emerging distinctions and collecting new annotations and retraining to accommodate such changes is costly. To address these challenges, we introduce category splitting, a new task where an existing classifier is edited to refine a coarse category into finer subcategories, while preserving accuracy elsewhere. We propose a zero-shot editing method that leverages the latent compositional structure of video classifiers to expose fine-grained distinctions without additional data. We further show that low-shot fine-tuning, while simple, is highly effective and benefits from our zero-shot initialization. Experiments on our new video benchmarks for category…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 2

Strengths

The method shows good improvement over baselines. The authors also provide many ablations such as choice of encoder, pretraining among others. The dataset generated has also been provided fully aiding in transparency. The concept of zero shot adaptation to finer granularity levels is interesting and warrants more attention.

Weaknesses

The paper uses the term "video model" very generally to refer specifically to a kind of video-language model. This is misleading as "video model" can refer to other concepts such as video generation models. The method is very hard to follow as the authors do not provide any preliminary information of the architecture that they are based upon. Eg. it is hard to follow which are the weight vectors that are being referred to as additive. Subjective: The title of the paper does not convey the prob

Reviewer 02Rating 6Confidence 4

Strengths

- Proposes and explores a novel problem of clear practical relevance - The work explores different forms of class splitting, both where the modifiers are seen or unseen - The proposed solution is tidy and lightweight: they edit only the classifier head, reuse structure already present in the model by building a modifier dictionary from existing labels - The authors construct 2 datasets for this problem from SSv2 and FineGym, which pose challenging test cases

Weaknesses

- Method depends significantly on the assumption that the original label space already contains enough compositional variation to learn good modifier vectors - Because the edit happens at the classifier head, it also assumes the backbone already captures the visual distinctions the new sublabels require; if the new split introduces visual novelty rather than just semantic refinement, a head-only edit will struggle, and the paper doesn’t really explore that failure mode - There is some amount o

Reviewer 03Rating 6Confidence 4

Strengths

-- Practical & Elegant Solution: Addresses real problem of fine-grained classification without expensive retraining - just intelligent matrix manipulation == Compositional Approach: The weight arithmetic (w_subcategory = w_coarse + v_modifier) is intuitive and enables systematic fine-grained category generation

Weaknesses

-- Scalability Questions: Method requires existing fine-grained examples to extract modifiers, and matrix growth (100→150 categories) may not scale to truly large taxonomies -- Clarity and Organization Issues: Paper was slightly difficult to follow on the main contribution and method - would benefit from restructuring for better readability (more intuitive figures perhaps?) -- Insufficient Baseline Comparisons: Only compares against basic vision-language models (CLIP, VideoCLIP) rather than es

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Face recognition and analysis