SparrowVQE: Visual Question Explanation for Course Content Understanding
Jialu Li, Manish Kumar Thota, Ruslan Gokhman, Radek Holik, and Youshan, Zhang

TL;DR
SparrowVQE is a novel multimodal model that enhances visual question answering by providing detailed explanations and understanding visual content in educational videos, outperforming existing methods.
Contribution
Introduces SparrowVQE, a small multimodal model with a three-stage training process, and a new dataset for visual question explanation in course content.
Findings
SparrowVQE outperforms state-of-the-art VQA methods on multiple datasets.
The model effectively connects visual and textual information for detailed explanations.
The three-stage training improves multimodal understanding and accuracy.
Abstract
Visual Question Answering (VQA) research seeks to create AI systems to answer natural language questions in images, yet VQA methods often yield overly simplistic and short answers. This paper aims to advance the field by introducing Visual Question Explanation (VQE), which enhances the ability of VQA to provide detailed explanations rather than brief responses and address the need for more complex interaction with visual content. We first created an MLVQE dataset from a 14-week streamed video machine learning course, including 885 slide images, 110,407 words of transcripts, and 9,416 designed question-answer (QA) pairs. Next, we proposed a novel SparrowVQE, a small 3 billion parameters multimodal model. We trained our model with a three-stage training mechanism consisting of multimodal pre-training (slide images and transcripts feature alignment), instruction tuning (tuning the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEducational Assessment and Pedagogy · Intelligent Tutoring Systems and Adaptive Learning · Natural Language Processing Techniques
