Beyond the Textual: Generating Coherent Visual Options for MCQs

Wanqiang Wang; Longzhu He; Wei Zheng

arXiv:2508.18772·cs.CV·August 27, 2025

Beyond the Textual: Generating Coherent Visual Options for MCQs

Wanqiang Wang, Longzhu He, Wei Zheng

PDF

1 Video

TL;DR

This paper introduces CmOS, a novel framework that leverages multimodal reasoning and retrieval-augmented generation to create high-quality visual options for multiple-choice questions, enhancing educational assessments.

Contribution

It presents a new cross-modal synthesis framework that effectively generates visual options for MCQs, addressing the limitations of manual creation and previous text-only methods.

Findings

01

CmOS outperforms existing methods in content discrimination.

02

It generates semantically plausible and visually similar options.

03

The framework is effective across various subjects and educational levels.

Abstract

Multiple-choice questions (MCQs) play a crucial role in fostering deep thinking and knowledge integration in education. However, previous research has primarily focused on generating MCQs with textual options, but it largely overlooks the visual options. Moreover, generating high-quality distractors remains a major challenge due to the high cost and limited scalability of manual authoring. To tackle these problems, we propose a Cross-modal Options Synthesis (CmOS), a novel framework for generating educational MCQs with visual options. Our framework integrates Multimodal Chain-of-Thought (MCoT) reasoning process and Retrieval-Augmented Generation (RAG) to produce semantically plausible and visually similar answer and distractors. It also includes a discrimination module to identify content suitable for visual options. Experimental results on test tasks demonstrate the superiority of CmOS…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Beyond the Textual: Generating Coherent Visual Options for MCQs· underline