StructTest: Benchmarking LLMs' Reasoning through Compositional Structured Outputs
Hailin Chen, Fangkai Jiao, Mathieu Ravaut, Nawshad Farruque, Xuan Phi, Nguyen, Chengwei Qin, Manan Dey, Bosheng Ding, Caiming Xiong, Shafiq Joty,, Yingbo Zhou

TL;DR
StructTest introduces a new benchmark for evaluating large language models' reasoning abilities through their capacity to generate structured, compositional outputs across multiple domains, offering an unbiased and scalable assessment method.
Contribution
It presents a novel, rule-based evaluation framework that effectively measures LLM reasoning by testing structured output generation, reducing biases and cheating risks.
Findings
StructTest remains challenging for top models like GPT-4o.
It provides a scalable, unbiased evaluation across diverse domains.
The benchmark is easily extendable to new tasks and datasets.
Abstract
The rapid advancement of large language models (LLMs) demands robust, unbiased, and scalable evaluation methods. However, human annotations are costly to scale, model-based evaluations are susceptible to stylistic biases, and target-answer-based benchmarks are vulnerable to data contamination and cheating. To address these limitations, we propose StructTest, a novel benchmark that evaluates LLMs on their ability to follow compositional instructions and generate structured outputs, providing an unbiased, cost-effective, and difficult-to-cheat evaluation framework. Assessments are conducted deterministically using a rule-based evaluator, which can be easily extended to new tasks and datasets. By testing structured outputs across diverse domains including Summarization, Code, HTML, and Math, and evaluating 17 popular LLMs, we demonstrate that StructTest remains challenging even for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Law · Law, AI, and Intellectual Property
