ContextualLVLM-Agent: A Holistic Framework for Multi-Turn Visually-Grounded Dialogue and Complex Instruction Following
Seungmin Han, Haeun Kwon, Ji-jun Park, Taeyang Yoon

TL;DR
This paper introduces a new benchmark and a holistic framework for multi-turn, visually-grounded dialogue tasks, significantly improving reasoning, instruction following, and context retention in large vision-language models.
Contribution
It presents MMDR-Bench, a comprehensive dataset for complex multi-turn dialogues, and CoLVLM Agent, a modular framework enhancing LVLMs with iterative reasoning and memory mechanisms without extensive retraining.
Findings
CoLVLM outperforms GPT-4o and Gemini 1.5 Pro in human evaluations.
The framework improves reasoning depth and instruction adherence.
Robust performance over extended dialogue turns.
Abstract
Despite significant advancements in Large Language Models (LLMs) and Large Vision-Language Models (LVLMs), current models still face substantial challenges in handling complex, multi-turn, and visually-grounded tasks that demand deep reasoning, sustained contextual understanding, entity tracking, and multi-step instruction following. Existing benchmarks often fall short in capturing the dynamism and intricacies of real-world multi-modal interactions, leading to issues such as context loss and visual hallucinations. To address these limitations, we introduce MMDR-Bench (Multi-Modal Dialogue Reasoning Benchmark), a novel dataset comprising 300 meticulously designed complex multi-turn dialogue scenarios, each averaging 5-7 turns and evaluated across six core dimensions including visual entity tracking and reasoning depth. Furthermore, we propose CoLVLM Agent (Contextual LVLM Agent), a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Speech and dialogue systems
