ManiBench: A Benchmark for Testing Visual-Logic Drift and Syntactic Hallucinations in Manim Code Generation
Nabin Oli

TL;DR
ManiBench is a new benchmark designed to evaluate large language models' ability to generate accurate and visually correct Manim code, addressing issues like syntax errors and logical drift in dynamic visualizations.
Contribution
This paper introduces ManiBench, a specialized benchmark for testing LLMs on generating dynamic, pedagogical visualizations with a focus on temporal fidelity and API correctness.
Findings
Effective evaluation framework for visual-logic accuracy.
Identification of common failure modes in code generation.
Open-source tools and datasets for benchmarking.
Abstract
Traditional benchmarks like HumanEval and MBPP test logic and syntax effectively, but fail when code must produce dynamic, pedagogical visuals. We introduce ManiBench, a specialized benchmark evaluating LLM performance in generating Manim CE code, where temporal fidelity and version-aware API correctness are critical. ManiBench targets two key failure modes: Syntactic Hallucinations (valid Python referencing non-existent or deprecated Manim APIs) and Visual-Logic Drift (generated visuals diverging from intended mathematical logic through timing errors or missing causal relationships). The benchmark comprises 150-200 problems across five difficulty levels spanning calculus, linear algebra, probability, topology, and AI, grounded in analysis of 3Blue1Brown's ManimGL source (53,000 lines, 143 scene classes). Evaluation uses a four-tier framework measuring Executability, Version-Conflict…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Testing and Debugging Techniques · Software Engineering Research · Logic, programming, and type systems
