LMM-Det: Make Large Multimodal Models Excel in Object Detection
Jincheng Li, Chunyu Xie, Ji Ao, Dawei Leng, Yuhui Yin

TL;DR
LMM-Det demonstrates that large multimodal models can be effectively adapted for object detection tasks without specialized detection modules, by analyzing and optimizing their capabilities through data and inference adjustments.
Contribution
Proposes a simple approach to enable large multimodal models to perform object detection without additional detection-specific components.
Findings
Significant recall degradation in LMMs for object detection compared to specialist detectors.
Data distribution adjustment and inference optimization improve detection recall.
Extensive experiments validate the effectiveness of LMM-Det.
Abstract
Large multimodal models (LMMs) have garnered wide-spread attention and interest within the artificial intelligence research and industrial communities, owing to their remarkable capability in multimodal understanding, reasoning, and in-context learning, among others. While LMMs have demonstrated promising results in tackling multimodal tasks like image captioning, visual question answering, and visual grounding, the object detection capabilities of LMMs exhibit a significant gap compared to specialist detectors. To bridge the gap, we depart from the conventional methods of integrating heavy detectors with LMMs and propose LMM-Det, a simple yet effective approach that leverages a Large Multimodal Model for vanilla object Detection without relying on specialized detection modules. Specifically, we conduct a comprehensive exploratory analysis when a large multimodal model meets with object…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification
