SO-Bench: A Structural Output Evaluation of Multimodal LLMs

Di Feng; Kaixin Ma; Feng Nan; Haofeng Chen; Bohan Zhai; David Griffiths; Mingfei Gao; Zhe Gan; Eshan Verma; Yinfei Yang; Zhifeng Chen; Afshin Dehghan

arXiv:2511.21750·cs.CV·March 19, 2026

SO-Bench: A Structural Output Evaluation of Multimodal LLMs

Di Feng, Kaixin Ma, Feng Nan, Haofeng Chen, Bohan Zhai, David Griffiths, Mingfei Gao, Zhe Gan, Eshan Verma, Yinfei Yang, Zhifeng Chen, Afshin Dehghan

PDF

Open Access

TL;DR

This paper introduces SO-Bench, a comprehensive benchmark for evaluating the ability of multimodal large language models to generate schema-compliant, structured outputs from visual inputs across diverse domains, revealing significant gaps and potential for improvement.

Contribution

The paper presents SO-Bench, the first systematic benchmark for schema-grounded visual output evaluation in multimodal LLMs, along with training strategies to enhance structured output capabilities.

Findings

01

Models show significant gaps in schema accuracy and compliance.

02

Benchmark reveals persistent challenges in multimodal structured reasoning.

03

Training methods can substantially improve structured output performance.

Abstract

Multimodal large language models (MLLMs) are increasingly deployed in real-world, agentic settings where outputs must not only be correct, but also conform to predefined data schemas. Despite recent progress in structured generation in textual domain, there is still no benchmark that systematically evaluates schema-grounded information extraction and reasoning over visual inputs. In this work, we conduct a comprehensive study of visual structural output capabilities for MLLMs with our carefully designed SO-Bench benchmark. Covering four visual domains, including UI screens, natural images, documents, and charts, SO-Bench is built from over 6.5K diverse JSON schemas and 1.8K curated image-schema pairs with human-verified quality. Benchmarking experiments on open-sourced and frontier proprietary models reveal persistent gaps in predicting accurate, schema compliant outputs, highlighting…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Data Visualization and Analytics