Structured Prompts Improve Evaluation of Language Models

Asad Aali; Muhammad Ahmed Mohsin; Vasiliki Bikia; Arnav Singhvi; Richard Gaus; Suhana Bedi; Hejie Cui; Miguel Fuentes; Alyssa Unell; Yifan Mai; Jordan Cahoon; Michael Pfeffer; Roxana Daneshjou; Sanmi Koyejo; Emily Alsentzer; Christopher Potts; Nigam H. Shah; Akshay S. Chaudhari

arXiv:2511.20836·cs.CL·April 2, 2026

Structured Prompts Improve Evaluation of Language Models

Asad Aali, Muhammad Ahmed Mohsin, Vasiliki Bikia, Arnav Singhvi, Richard Gaus, Suhana Bedi, Hejie Cui, Miguel Fuentes, Alyssa Unell, Yifan Mai, Jordan Cahoon, Michael Pfeffer, Roxana Daneshjou, Sanmi Koyejo, Emily Alsentzer, Christopher Potts, Nigam H. Shah, Akshay S. Chaudhari

PDF

2 Repos

TL;DR

Structured prompts significantly influence language model evaluation outcomes, with the new DSPy+HELM framework enabling systematic analysis of prompt effects on benchmark scores.

Contribution

This work introduces a reproducible framework combining DSPy and HELM to study prompt impact on language model benchmarking, revealing prompt choice's substantial effect.

Findings

01

Prompt choice can materially impact leaderboard rankings.

02

Structured prompting improves performance by 6% on average.

03

Most gains come from chain-of-thought prompting, with limited benefit from advanced optimizers.

Abstract

As language models (LMs) are increasingly adopted across domains, high-quality benchmarking frameworks are essential for guiding deployment decisions. In practice, however, frameworks such as Holistic Evaluation of Language Models (HELM) typically evaluate models under a single static prompt configuration, even though model behavior depends strongly on prompt choice. As a result, reported scores can reflect prompt choice as much as model capability. Declarative prompting frameworks such as DSPy offer a scalable way to evaluate models under a set of structured prompting strategies rather than a static prompt configuration. We present a reproducible DSPy+HELM framework for studying how prompt choice impacts reported benchmark outcomes. Using five prompting methods, we evaluate four frontier and two open-source LMs across seven benchmarks against existing HELM baseline scores. By…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.