ChainForge: A Visual Toolkit for Prompt Engineering and LLM Hypothesis Testing
Ian Arawjo, Chelse Swoopes, Priyan Vaithilingam, Martin Wattenberg,, Elena Glassman

TL;DR
ChainForge is an open-source visual toolkit that simplifies prompt engineering and hypothesis testing for large language models, enabling users to compare responses, design prompts, and evaluate models without programming expertise.
Contribution
It introduces a graphical interface for prompt engineering and hypothesis testing, supporting diverse user needs and facilitating exploration, evaluation, and refinement of LLM outputs.
Findings
Users could investigate hypotheses effectively using ChainForge.
The toolkit supports model selection, prompt design, and hypothesis testing.
Users from various backgrounds found it accessible and useful.
Abstract
Evaluating outputs of large language models (LLMs) is challenging, requiring making -- and making sense of -- many responses. Yet tools that go beyond basic prompting tend to require knowledge of programming APIs, focus on narrow domains, or are closed-source. We present ChainForge, an open-source visual toolkit for prompt engineering and on-demand hypothesis testing of text generation LLMs. ChainForge provides a graphical interface for comparison of responses across models and prompt variations. Our system was designed to support three tasks: model selection, prompt template design, and hypothesis testing (e.g., auditing). We released ChainForge early in its development and iterated on its design with academics and online users. Through in-lab and interview studies, we find that a range of people could use ChainForge to investigate hypotheses that matter to them, including in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Topic Modeling · Software System Performance and Reliability
MethodsFocus
