Multiplication in Multimodal LLMs: Computation with Text, Image, and Audio Inputs

Samuel G. Balter; Ethan Jerzak; Connor T. Jerzak

arXiv:2604.18203·cs.CL·April 21, 2026

Multiplication in Multimodal LLMs: Computation with Text, Image, and Audio Inputs

Samuel G. Balter, Ethan Jerzak, Connor T. Jerzak

PDF

1 Repo 1 Datasets

TL;DR

This paper introduces a controlled multimodal multiplication benchmark to evaluate and analyze the arithmetic capabilities of large language models across text, image, and audio modalities, revealing computational limitations and reasoning tendencies.

Contribution

It presents a new multimodal multiplication benchmark with systematic variation and a mechanistic proxy for arithmetic load, enabling detailed analysis of model performance and reasoning procedures.

Findings

01

Accuracy drops sharply as arithmetic load increases, often nearing zero beyond C > 100.

02

Models perform near-perfect (> 99%) on perception tasks across modalities, despite arithmetic failures.

03

Decomposition reasoning is preferred in models, but heuristic-specific adapters can degrade accuracy.

Abstract

Multimodal LLMs can accurately perceive numerical content across modalities yet fail to perform exact multi-digit multiplication when the identical underlying arithmetic problem is presented as numerals, number words, images, or in audio form. Because existing benchmarks often lack systematically paired instances across modalities, it remains difficult to compare genuine arithmetic limits within and across model families. We therefore introduce a controlled multimodal multiplication benchmark that factorially varies digit length, digit sparsity, representation (e.g., numerals vs. number words), and modality (text, rendered images, audio), with paired instances from a reproducible generator. We also define arithmetic load, C, as the product of the total and non-zero digit count as a compact, mechanistically motivated proxy for operation count. Across evaluations, accuracy falls sharply…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cjerzak/llm-multimodal-math
github

Datasets

cjerzak/MultimodalMathBenchmarks
dataset· 3.5k dl
3.5k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.