Multi-Turn Multi-Modal Question Clarification for Enhanced Conversational Understanding
Kimia Ramezan, Alireza Amiri Bavandpour, Yifei Yuan, Clemencia Siro,, Mohammad Aliannejadi

TL;DR
This paper introduces a multi-turn multi-modal clarification framework that combines text and images to improve conversational search, demonstrating significant performance gains over single-turn and uni-modal methods.
Contribution
It presents the MMCQ task, creates a large-scale dataset ClariMM, and proposes Mario, a retrieval framework that enhances query refinement through multi-modal, multi-turn interactions.
Findings
Multi-turn multi-modal clarification outperforms uni-modal approaches.
The proposed method improves MRR by 12.88%.
Performance gains are most notable in longer interactions.
Abstract
Conversational query clarification enables users to refine their search queries through interactive dialogue, improving search effectiveness. Traditional approaches rely on text-based clarifying questions, which often fail to capture complex user preferences, particularly those involving visual attributes. While recent work has explored single-turn multi-modal clarification with images alongside text, such methods do not fully support the progressive nature of user intent refinement over multiple turns. Motivated by this, we introduce the Multi-turn Multi-modal Clarifying Questions (MMCQ) task, which combines text and visual modalities to refine user queries in a multi-turn conversation. To facilitate this task, we create a large-scale dataset named ClariMM comprising over 13k multi-turn interactions and 33k question-answer pairs containing multi-modal clarifying questions. We propose…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Educational Technology and Assessment
