StructTest: Benchmarking LLMs' Reasoning through Compositional   Structured Outputs

Hailin Chen; Fangkai Jiao; Mathieu Ravaut; Nawshad Farruque; Xuan Phi; Nguyen; Chengwei Qin; Manan Dey; Bosheng Ding; Caiming Xiong; Shafiq Joty,; Yingbo Zhou

arXiv:2412.18011·cs.CL·March 21, 2025

StructTest: Benchmarking LLMs' Reasoning through Compositional Structured Outputs

Hailin Chen, Fangkai Jiao, Mathieu Ravaut, Nawshad Farruque, Xuan Phi, Nguyen, Chengwei Qin, Manan Dey, Bosheng Ding, Caiming Xiong, Shafiq Joty,, Yingbo Zhou

PDF

Open Access

TL;DR

StructTest introduces a new benchmark for evaluating large language models' reasoning abilities through their capacity to generate structured, compositional outputs across multiple domains, offering an unbiased and scalable assessment method.

Contribution

It presents a novel, rule-based evaluation framework that effectively measures LLM reasoning by testing structured output generation, reducing biases and cheating risks.

Findings

01

StructTest remains challenging for top models like GPT-4o.

02

It provides a scalable, unbiased evaluation across diverse domains.

03

The benchmark is easily extendable to new tasks and datasets.

Abstract

The rapid advancement of large language models (LLMs) demands robust, unbiased, and scalable evaluation methods. However, human annotations are costly to scale, model-based evaluations are susceptible to stylistic biases, and target-answer-based benchmarks are vulnerable to data contamination and cheating. To address these limitations, we propose StructTest, a novel benchmark that evaluates LLMs on their ability to follow compositional instructions and generate structured outputs, providing an unbiased, cost-effective, and difficult-to-cheat evaluation framework. Assessments are conducted deterministically using a rule-based evaluator, which can be easily extended to new tasks and datasets. By testing structured outputs across diverse domains including Summarization, Code, HTML, and Math, and evaluating 17 popular LLMs, we demonstrate that StructTest remains challenging even for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Law · Law, AI, and Intellectual Property