MuDoC: An Interactive Multimodal Document-grounded Conversational AI   System

Karan Taneja; Ashok K. Goel

arXiv:2502.09843·cs.AI·February 17, 2025

MuDoC: An Interactive Multimodal Document-grounded Conversational AI System

Karan Taneja, Ashok K. Goel

PDF

Open Access

TL;DR

MuDoC is an interactive multimodal conversational AI system that leverages both textual and visual document content to generate grounded responses, enhancing trust and verification in human-AI interactions.

Contribution

This work introduces MuDoC, a novel multimodal AI system that directly incorporates document visuals alongside text for improved response generation.

Findings

01

MuDoC effectively integrates text and figures for grounded responses.

02

The system enhances user trust through source verification features.

03

Qualitative analysis reveals strengths and limitations of MuDoC.

Abstract

Multimodal AI is an important step towards building effective tools to leverage multiple modalities in human-AI communication. Building a multimodal document-grounded AI system to interact with long documents remains a challenge. Our work aims to fill the research gap of directly leveraging grounded visuals from documents alongside textual content in documents for response generation. We present an interactive conversational AI agent 'MuDoC' based on GPT-4o to generate document-grounded responses with interleaved text and figures. MuDoC's intelligent textbook interface promotes trustworthiness and enables verification of system responses by allowing instant navigation to source text and figures in the documents. We also discuss qualitative observations based on MuDoC responses highlighting its strengths and limitations.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech and dialogue systems