ContextQFormer: A New Context Modeling Method for Multi-Turn Multi-Modal Conversations

Yiming Lei; Zhizheng Yang; Zeming Liu; Haitao Leng; Shaoguo Liu; Tingting Gao; Qingjie Liu; Yunhong Wang

arXiv:2505.23121·cs.CL·July 18, 2025

ContextQFormer: A New Context Modeling Method for Multi-Turn Multi-Modal Conversations

Yiming Lei, Zhizheng Yang, Zeming Liu, Haitao Leng, Shaoguo Liu, Tingting Gao, Qingjie Liu, Yunhong Wang

PDF

Open Access

TL;DR

This paper introduces ContextQFormer, a novel context modeling module with memory for enhancing multi-turn multi-modal dialogue systems, supported by a new dataset TMDialog, leading to improved interaction capabilities.

Contribution

The paper proposes ContextQFormer with a memory-based context modeling approach and introduces TMDialog, a new dataset for multi-turn multi-modal dialogue research.

Findings

01

ContextQFormer improves available rate by 2%-4% over baselines.

02

TMDialog contains longer conversations to support multi-turn dialogue research.

03

Experimental results validate the effectiveness of ContextQFormer.

Abstract

Multi-modal large language models have demonstrated remarkable zero-shot abilities and powerful image-understanding capabilities. However, the existing open-source multi-modal models suffer from the weak capability of multi-turn interaction, especially for long contexts. To address the issue, we first introduce a context modeling module, termed ContextQFormer, which utilizes a memory block to enhance the presentation of contextual information. Furthermore, to facilitate further research, we carefully build a new multi-turn multi-modal dialogue dataset (TMDialog) for pre-training, instruction-tuning, and evaluation, which will be open-sourced lately. Compared with other multi-modal dialogue datasets, TMDialog contains longer conversations, which supports the research of multi-turn multi-modal dialogue. In addition, ContextQFormer is compared with three baselines on TMDialog and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Speech and dialogue systems