MuDoC: An Interactive Multimodal Document-grounded Conversational AI System
Karan Taneja, Ashok K. Goel

TL;DR
MuDoC is an interactive multimodal conversational AI system that leverages both textual and visual document content to generate grounded responses, enhancing trust and verification in human-AI interactions.
Contribution
This work introduces MuDoC, a novel multimodal AI system that directly incorporates document visuals alongside text for improved response generation.
Findings
MuDoC effectively integrates text and figures for grounded responses.
The system enhances user trust through source verification features.
Qualitative analysis reveals strengths and limitations of MuDoC.
Abstract
Multimodal AI is an important step towards building effective tools to leverage multiple modalities in human-AI communication. Building a multimodal document-grounded AI system to interact with long documents remains a challenge. Our work aims to fill the research gap of directly leveraging grounded visuals from documents alongside textual content in documents for response generation. We present an interactive conversational AI agent 'MuDoC' based on GPT-4o to generate document-grounded responses with interleaved text and figures. MuDoC's intelligent textbook interface promotes trustworthiness and enables verification of system responses by allowing instant navigation to source text and figures in the documents. We also discuss qualitative observations based on MuDoC responses highlighting its strengths and limitations.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech and dialogue systems
