Visual-Linguistic Agent: Towards Collaborative Contextual Object   Reasoning

Jingru Yang; Huan Yu; Yang Jingxin; Chentianye Xu; Yin Biao; Yu Sun,; Shengfeng He

arXiv:2411.10252·cs.CV·November 18, 2024

Visual-Linguistic Agent: Towards Collaborative Contextual Object Reasoning

Jingru Yang, Huan Yu, Yang Jingxin, Chentianye Xu, Yin Biao, Yu Sun,, Shengfeng He

PDF

Open Access

TL;DR

The paper introduces the Visual-Linguistic Agent (VLA), a collaborative framework combining multimodal language models and traditional detectors to improve object localization and contextual reasoning in images.

Contribution

It proposes a novel collaborative approach that integrates relational reasoning of MLLMs with precise localization of object detectors, enhancing multimodal understanding.

Findings

01

Significant performance improvements on COCO dataset

02

Enhanced spatial reasoning and object localization

03

Set new benchmarks in contextually coherent detection

Abstract

Multimodal Large Language Models (MLLMs) excel at descriptive tasks within images but often struggle with precise object localization, a critical element for reliable visual interpretation. In contrast, traditional object detection models provide high localization accuracy but frequently generate detections lacking contextual coherence due to limited modeling of inter-object relationships. To address this fundamental limitation, we introduce the \textbf{Visual-Linguistic Agent (VLA), a collaborative framework that combines the relational reasoning strengths of MLLMs with the precise localization capabilities of traditional object detectors. In the VLA paradigm, the MLLM serves as a central Linguistic Agent, working collaboratively with specialized Vision Agents for object detection and classification. The Linguistic Agent evaluates and refines detections by reasoning over spatial and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Semantic Web and Ontologies

MethodsSparse Evolutionary Training