Multiplication in Multimodal LLMs: Computation with Text, Image, and Audio Inputs
Samuel G. Balter, Ethan Jerzak, Connor T. Jerzak

TL;DR
This paper introduces a controlled multimodal multiplication benchmark to evaluate and analyze the arithmetic capabilities of large language models across text, image, and audio modalities, revealing computational limitations and reasoning tendencies.
Contribution
It presents a new multimodal multiplication benchmark with systematic variation and a mechanistic proxy for arithmetic load, enabling detailed analysis of model performance and reasoning procedures.
Findings
Accuracy drops sharply as arithmetic load increases, often nearing zero beyond C > 100.
Models perform near-perfect (> 99%) on perception tasks across modalities, despite arithmetic failures.
Decomposition reasoning is preferred in models, but heuristic-specific adapters can degrade accuracy.
Abstract
Multimodal LLMs can accurately perceive numerical content across modalities yet fail to perform exact multi-digit multiplication when the identical underlying arithmetic problem is presented as numerals, number words, images, or in audio form. Because existing benchmarks often lack systematically paired instances across modalities, it remains difficult to compare genuine arithmetic limits within and across model families. We therefore introduce a controlled multimodal multiplication benchmark that factorially varies digit length, digit sparsity, representation (e.g., numerals vs. number words), and modality (text, rendered images, audio), with paired instances from a reproducible generator. We also define arithmetic load, C, as the product of the total and non-zero digit count as a compact, mechanistically motivated proxy for operation count. Across evaluations, accuracy falls sharply…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
