OvarNet: Towards Open-vocabulary Object Attribute Recognition
Keyan Chen, Xiaolong Jiang, Yao Hu, Xu Tang, Yan Gao, Jianqi Chen,, Weidi Xie

TL;DR
This paper introduces OvarNet, a comprehensive approach for open-vocabulary object detection and attribute recognition, leveraging federated training, weak supervision, and knowledge distillation to improve generalization and efficiency.
Contribution
It proposes a multi-stage framework combining dataset fusion, weakly supervised learning, and knowledge distillation for open-vocabulary object and attribute detection.
Findings
Joint training improves scene understanding accuracy.
Model generalizes well to unseen attributes and categories.
End-to-end training outperforms naive two-stage methods.
Abstract
In this paper, we consider the problem of simultaneously detecting objects and inferring their visual attributes in an image, even for those with no manual annotations provided at the training stage, resembling an open-vocabulary scenario. To achieve this goal, we make the following contributions: (i) we start with a naive two-stage approach for open-vocabulary object detection and attribute classification, termed CLIP-Attr. The candidate objects are first proposed with an offline RPN and later classified for semantic category and attributes; (ii) we combine all available datasets and train with a federated strategy to finetune the CLIP model, aligning the visual representation with attributes, additionally, we investigate the efficacy of leveraging freely available online image-caption pairs under weakly supervised learning; (iii) in pursuit of efficiency, we train a Faster-RCNN type…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
MethodsContrastive Language-Image Pre-training · Region Proposal Network
