# MQADet: a plug-and-play paradigm for enhancing open-vocabulary object detection via multimodal question answering

**Authors:** Caixiong Li, Xiongwei Zhao, Jinhang Zhang, Xing Zhang, Qihao Sun, Zhou Wu

PMC · DOI: 10.1038/s41598-026-36936-x · Scientific Reports · 2026-01-27

## TL;DR

MQADet is a new method that improves object detection by using multimodal reasoning to handle complex and unseen categories described in text.

## Contribution

MQADet introduces a plug-and-play framework using MLLMs to enhance open-vocabulary object detection without additional training.

## Key findings

- MQADet improves detection accuracy for unseen and complex categories across multiple datasets.
- The three-stage MQA pipeline effectively refines object localization using textual queries.
- Experiments show consistent performance gains over existing detectors in challenging scenarios.

## Abstract

Open-vocabulary detection (OVD) aims to detect and classify objects from an unrestricted set of categories, including those unseen during training. Existing open-vocabulary detectors often suffer from visual-textual misalignment and long-tailed category imbalance, leading to poor performance when handling objects described by complex, long-tailed textual queries. To overcome these challenges, we propose Multimodal Question Answering Detection (MQADet), a universal plug-and-play paradigm that enhances existing open-vocabulary detectors by leveraging the cross-modal reasoning capabilities of multimodal large language models (MLLMs). MQADet can be seamlessly integrated with pre-trained object detectors without requiring additional training or fine-tuning. Specifically, we design a novel three-stage Multimodal Question Answering (MQA) pipeline that guides MLLMs to accurately localize objects described by complex textual queries while refining the focus of existing detectors toward semantically relevant regions. To evaluate our approach, we construct a comprehensive benchmark across four challenging open-vocabulary datasets and integrate three state-of-the-art detectors as baselines. Extensive experiments demonstrate that MQADet consistently improves detection accuracy, particularly for unseen and linguistically complex categories, across diverse and challenging scenarios. To support further research, we will publicly release our code.

## Full-text entities

- **Diseases:** OmDet-Turbo (MESH:D010267), MLLMs (MESH:D007806), TMOP (MESH:D014012), OV (MESH:D005597)
- **Chemicals:** DINO (-)
- **Species:** Canis lupus familiaris (dog, subspecies) [taxon 9615], Musa acuminata (banana, species) [taxon 4641], Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12905182/full.md

## Figures

8 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12905182/full.md

## References

15 references — full list in the complete paper: https://tomesphere.com/paper/PMC12905182/full.md

---
Source: https://tomesphere.com/paper/PMC12905182