TaskCLIP: Extend Large Vision-Language Model for Task Oriented Object Detection
Hanning Chen, Wenjun Huang, Yang Ni, Sanggeon Yun, Yezi Liu, Fei Wen,, Alvaro Velasquez, Hugo Latapie, Mohsen Imani

TL;DR
TaskCLIP introduces a two-stage, vision-language model-based approach for task-oriented object detection, leveraging large VLMs and a transformer aligner to improve accuracy and generalizability over existing methods.
Contribution
The paper presents a novel two-stage framework using large vision-language models and a transformer aligner for improved task-oriented object detection.
Findings
Outperforms state-of-the-art DETR-based model TOIST by 3.5%
Requires only a single NVIDIA RTX 4090 for training and inference
Effectively aligns embeddings for better object selection
Abstract
Task-oriented object detection aims to find objects suitable for accomplishing specific tasks. As a challenging task, it requires simultaneous visual data processing and reasoning under ambiguous semantics. Recent solutions are mainly all-in-one models. However, the object detection backbones are pre-trained without text supervision. Thus, to incorporate task requirements, their intricate models undergo extensive learning on a highly imbalanced and scarce dataset, resulting in capped performance, laborious training, and poor generalizability. In contrast, we propose TaskCLIP, a more natural two-stage design composed of general object detection and task-guided object selection. Particularly for the latter, we resort to the recently successful large Vision-Language Models (VLMs) as our backbone, which provides rich semantic knowledge and a uniform embedding space for images and texts.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Robotics and Automated Systems · Advanced Neural Network Applications
