Contextual Object Detection with Multimodal Large Language Models

Yuhang Zang; Wei Li; Jun Han; Kaiyang Zhou; Chen Change Loy

arXiv:2305.18279·cs.CV·August 13, 2024·6 cites

Contextual Object Detection with Multimodal Large Language Models

Yuhang Zang, Wei Li, Jun Han, Kaiyang Zhou, Chen Change Loy

PDF

Open Access 1 Repo

TL;DR

This paper introduces ContextDET, a multimodal model that enhances large language models with the ability to perform contextual object detection within human-AI interactive scenarios, addressing a key perception gap.

Contribution

The work presents a novel generate-then-detect framework and a unified model for end-to-end visual-language contextual object detection, enabling detection of human vocabulary objects.

Findings

01

Outperforms existing models on CODE benchmark

02

Effective open-vocabulary detection capabilities

03

Improves referring image segmentation accuracy

Abstract

Recent Multimodal Large Language Models (MLLMs) are remarkable in vision-language tasks, such as image captioning and question answering, but lack the essential perception ability, i.e., object detection. In this work, we address this limitation by introducing a novel research problem of contextual object detection -- understanding visible objects within different human-AI interactive contexts. Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering. Moreover, we present ContextDET, a unified multimodal model that is capable of end-to-end differentiable modeling of visual-language contexts, so as to locate, identify, and associate visual objects with language inputs for human-AI interaction. Our ContextDET involves three key submodels: (i) a visual encoder for extracting visual representations, (ii) a pre-trained LLM…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yuhangzang/contextdet
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling