TaskCLIP: Extend Large Vision-Language Model for Task Oriented Object   Detection

Hanning Chen; Wenjun Huang; Yang Ni; Sanggeon Yun; Yezi Liu; Fei Wen,; Alvaro Velasquez; Hugo Latapie; Mohsen Imani

arXiv:2403.08108·cs.CV·September 9, 2024·5 cites

TaskCLIP: Extend Large Vision-Language Model for Task Oriented Object Detection

Hanning Chen, Wenjun Huang, Yang Ni, Sanggeon Yun, Yezi Liu, Fei Wen,, Alvaro Velasquez, Hugo Latapie, Mohsen Imani

PDF

Open Access

TL;DR

TaskCLIP introduces a two-stage, vision-language model-based approach for task-oriented object detection, leveraging large VLMs and a transformer aligner to improve accuracy and generalizability over existing methods.

Contribution

The paper presents a novel two-stage framework using large vision-language models and a transformer aligner for improved task-oriented object detection.

Findings

01

Outperforms state-of-the-art DETR-based model TOIST by 3.5%

02

Requires only a single NVIDIA RTX 4090 for training and inference

03

Effectively aligns embeddings for better object selection

Abstract

Task-oriented object detection aims to find objects suitable for accomplishing specific tasks. As a challenging task, it requires simultaneous visual data processing and reasoning under ambiguous semantics. Recent solutions are mainly all-in-one models. However, the object detection backbones are pre-trained without text supervision. Thus, to incorporate task requirements, their intricate models undergo extensive learning on a highly imbalanced and scarce dataset, resulting in capped performance, laborious training, and poor generalizability. In contrast, we propose TaskCLIP, a more natural two-stage design composed of general object detection and task-guided object selection. Particularly for the latter, we resort to the recently successful large Vision-Language Models (VLMs) as our backbone, which provides rich semantic knowledge and a uniform embedding space for images and texts.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Robotics and Automated Systems · Advanced Neural Network Applications