ContextualLVLM-Agent: A Holistic Framework for Multi-Turn Visually-Grounded Dialogue and Complex Instruction Following

Seungmin Han; Haeun Kwon; Ji-jun Park; Taeyang Yoon

arXiv:2508.15164·cs.CL·August 22, 2025

ContextualLVLM-Agent: A Holistic Framework for Multi-Turn Visually-Grounded Dialogue and Complex Instruction Following

Seungmin Han, Haeun Kwon, Ji-jun Park, Taeyang Yoon

PDF

Open Access

TL;DR

This paper introduces a new benchmark and a holistic framework for multi-turn, visually-grounded dialogue tasks, significantly improving reasoning, instruction following, and context retention in large vision-language models.

Contribution

It presents MMDR-Bench, a comprehensive dataset for complex multi-turn dialogues, and CoLVLM Agent, a modular framework enhancing LVLMs with iterative reasoning and memory mechanisms without extensive retraining.

Findings

01

CoLVLM outperforms GPT-4o and Gemini 1.5 Pro in human evaluations.

02

The framework improves reasoning depth and instruction adherence.

03

Robust performance over extended dialogue turns.

Abstract

Despite significant advancements in Large Language Models (LLMs) and Large Vision-Language Models (LVLMs), current models still face substantial challenges in handling complex, multi-turn, and visually-grounded tasks that demand deep reasoning, sustained contextual understanding, entity tracking, and multi-step instruction following. Existing benchmarks often fall short in capturing the dynamism and intricacies of real-world multi-modal interactions, leading to issues such as context loss and visual hallucinations. To address these limitations, we introduce MMDR-Bench (Multi-Modal Dialogue Reasoning Benchmark), a novel dataset comprising 300 meticulously designed complex multi-turn dialogue scenarios, each averaging 5-7 turns and evaluated across six core dimensions including visual entity tracking and reasoning depth. Furthermore, we propose CoLVLM Agent (Contextual LVLM Agent), a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Speech and dialogue systems