LEMON: How Well Do MLLMs Perform Temporal Multimodal Understanding on Instructional Videos?

Zhuang Yu; Lei Shen; Jing Zhao; Shiliang Sun

arXiv:2601.20705·cs.CV·January 29, 2026

LEMON: How Well Do MLLMs Perform Temporal Multimodal Understanding on Instructional Videos?

Zhuang Yu, Lei Shen, Jing Zhao, Shiliang Sun

PDF

Open Access

TL;DR

LEMON introduces a comprehensive benchmark for evaluating multimodal large language models on long-form, educational STEM videos, emphasizing temporal reasoning, cross-modal integration, and pedagogical understanding.

Contribution

This paper presents LEMON, a new benchmark with diverse tasks and rich content to assess MLLMs' capabilities in understanding and reasoning over instructional videos.

Findings

01

State-of-the-art MLLMs perform poorly on temporal reasoning tasks.

02

LEMON reveals significant gaps in current models' understanding of instructional content.

03

Benchmark encourages development of more advanced multimodal reasoning models.

Abstract

Recent multimodal large language models (MLLMs) have shown remarkable progress across vision, audio, and language tasks, yet their performance on long-form, knowledge-intensive, and temporally structured educational content remains largely unexplored. To bridge this gap, we introduce LEMON, a Lecture-based Evaluation benchmark for MultimOdal uNderstanding, focusing on STEM lecture videos that require long-horizon reasoning and cross-modal integration. LEMON comprises 2,277 video segments spanning 5 disciplines and 29 courses, with an average duration of 196.1 seconds, yielding 4,181 high-quality QA pairs, including 3,413 multiple-choice and 768 open-ended questions. Distinct from existing video benchmarks, LEMON features: (1) semantic richness and disciplinary density, (2) tightly coupled video-audio-text modalities, (3) explicit temporal and pedagogical structure, and (4) contextually…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Intelligent Tutoring Systems and Adaptive Learning