Contextual Object Detection with Multimodal Large Language Models
Yuhang Zang, Wei Li, Jun Han, Kaiyang Zhou, Chen Change Loy

TL;DR
This paper introduces ContextDET, a multimodal model that enhances large language models with the ability to perform contextual object detection within human-AI interactive scenarios, addressing a key perception gap.
Contribution
The work presents a novel generate-then-detect framework and a unified model for end-to-end visual-language contextual object detection, enabling detection of human vocabulary objects.
Findings
Outperforms existing models on CODE benchmark
Effective open-vocabulary detection capabilities
Improves referring image segmentation accuracy
Abstract
Recent Multimodal Large Language Models (MLLMs) are remarkable in vision-language tasks, such as image captioning and question answering, but lack the essential perception ability, i.e., object detection. In this work, we address this limitation by introducing a novel research problem of contextual object detection -- understanding visible objects within different human-AI interactive contexts. Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering. Moreover, we present ContextDET, a unified multimodal model that is capable of end-to-end differentiable modeling of visual-language contexts, so as to locate, identify, and associate visual objects with language inputs for human-AI interaction. Our ContextDET involves three key submodels: (i) a visual encoder for extracting visual representations, (ii) a pre-trained LLM…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
