Orchestrating LLM Agents for Scientific Research: A Pilot Study of Multiple Choice Question (MCQ) Generation and Evaluation

Yuan An

arXiv:2602.18891·cs.CY·February 24, 2026

Orchestrating LLM Agents for Scientific Research: A Pilot Study of Multiple Choice Question (MCQ) Generation and Evaluation

Yuan An

PDF

Open Access

TL;DR

This pilot study demonstrates how human researchers can coordinate multiple LLM agents to generate and evaluate multiple-choice questions, revealing strengths and gaps in AI-assisted scientific research workflows.

Contribution

The paper introduces a novel AI-orchestrated research workflow for MCQ generation and evaluation, highlighting the shift in researcher roles and identifying persistent quality gaps in AI-generated content.

Findings

01

Generated MCQs had high overall quality but lacked full similarity to expert questions.

02

Surface-level qualities like grammar and clarity were consistently strong.

03

Gaps were found in skill depth, cognitive engagement, and metadata alignment.

Abstract

Advances in large language models (LLMs) are rapidly transforming scientific work, yet empirical evidence on how these systems reshape research activities remains limited. We report a mixed-methods pilot evaluation of an AI-orchestrated research workflow in which a human researcher coordinated multiple LLM-based agents to perform data extraction, corpus construction, artifact generation, and artifact evaluation. Using the generation and assessment of multiple-choice questions (MCQs) as a testbed, we collected 1,071 SAT Math MCQs and employed LLM agents to extract questions from PDFs, retrieve and convert open textbooks into structured representations, align each MCQ with relevant textbook content, generate new MCQs under specified difficulty and cognitive levels, and evaluate both original and generated MCQs using a 24-criterion quality framework. Across all evaluations, average MCQ…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Text Readability and Simplification · Computational and Text Analysis Methods