Open Ad-hoc Categorization with Contextualized Feature Learning

Zilin Wang; Sangwoo Mo; Stella X. Yu; Sima Behpour; Liu Ren

arXiv:2512.16202·cs.CV·December 19, 2025

Open Ad-hoc Categorization with Contextualized Feature Learning

Zilin Wang, Sangwoo Mo, Stella X. Yu, Sima Behpour, Liu Ren

PDF

Open Access

TL;DR

This paper introduces OAK, a model that enhances open ad-hoc categorization by combining CLIP's perceptual capabilities with visual clustering, enabling accurate, interpretable, and adaptable categorization of visual scenes with minimal labeled data.

Contribution

OAK leverages learnable context tokens with CLIP, integrating alignment and clustering objectives to improve ad-hoc categorization and concept discovery.

Findings

01

Achieves 87.4% accuracy on Stanford Mood dataset

02

Outperforms CLIP and GCD by over 50% in novel accuracy

03

Produces interpretable saliency maps highlighting relevant visual features

Abstract

Adaptive categorization of visual scenes is essential for AI agents to handle changing tasks. Unlike fixed common categories for plants or animals, ad-hoc categories are created dynamically to serve specific goals. We study open ad-hoc categorization: Given a few labeled exemplars and abundant unlabeled data, the goal is to discover the underlying context and to expand ad-hoc categories through semantic extension and visual clustering around it. Building on the insight that ad-hoc and common categories rely on similar perceptual mechanisms, we propose OAK, a simple model that introduces a small set of learnable context tokens at the input of a frozen CLIP and optimizes with both CLIP's image-text alignment objective and GCD's visual clustering objective. On Stanford and Clevr-4 datasets, OAK achieves state-of-the-art in accuracy and concept discovery across multiple categorizations,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Visual Attention and Saliency Detection