Open Ad-hoc Categorization with Contextualized Feature Learning
Zilin Wang, Sangwoo Mo, Stella X. Yu, Sima Behpour, Liu Ren

TL;DR
This paper introduces OAK, a model that enhances open ad-hoc categorization by combining CLIP's perceptual capabilities with visual clustering, enabling accurate, interpretable, and adaptable categorization of visual scenes with minimal labeled data.
Contribution
OAK leverages learnable context tokens with CLIP, integrating alignment and clustering objectives to improve ad-hoc categorization and concept discovery.
Findings
Achieves 87.4% accuracy on Stanford Mood dataset
Outperforms CLIP and GCD by over 50% in novel accuracy
Produces interpretable saliency maps highlighting relevant visual features
Abstract
Adaptive categorization of visual scenes is essential for AI agents to handle changing tasks. Unlike fixed common categories for plants or animals, ad-hoc categories are created dynamically to serve specific goals. We study open ad-hoc categorization: Given a few labeled exemplars and abundant unlabeled data, the goal is to discover the underlying context and to expand ad-hoc categories through semantic extension and visual clustering around it. Building on the insight that ad-hoc and common categories rely on similar perceptual mechanisms, we propose OAK, a simple model that introduces a small set of learnable context tokens at the input of a frozen CLIP and optimizes with both CLIP's image-text alignment objective and GCD's visual clustering objective. On Stanford and Clevr-4 datasets, OAK achieves state-of-the-art in accuracy and concept discovery across multiple categorizations,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Visual Attention and Saliency Detection
