A Comparative Study of DSPy Teleprompter Algorithms for Aligning Large   Language Models Evaluation Metrics to Human Evaluation

Bhaskarjit Sarmah; Kriti Dutta; Anna Grigoryan; Sachin Tiwari; Stefano; Pasquali; Dhagash Mehta

arXiv:2412.15298·cs.CL·December 23, 2024

A Comparative Study of DSPy Teleprompter Algorithms for Aligning Large Language Models Evaluation Metrics to Human Evaluation

Bhaskarjit Sarmah, Kriti Dutta, Anna Grigoryan, Sachin Tiwari, Stefano, Pasquali, Dhagash Mehta

PDF

Open Access

TL;DR

This paper compares five DSPy teleprompter algorithms for aligning large language model evaluation metrics with human judgments, demonstrating that optimized prompts can outperform benchmarks in hallucination detection.

Contribution

It provides a comparative analysis of five teleprompter algorithms within DSPy for aligning LLM evaluations to human annotations, highlighting their relative effectiveness.

Findings

01

Optimized prompts outperform benchmark methods in hallucination detection.

02

Certain teleprompter algorithms outperform others in specific experiments.

03

Prompt optimization improves alignment with human evaluations.

Abstract

We argue that the Declarative Self-improving Python (DSPy) optimizers are a way to align the large language model (LLM) prompts and their evaluations to the human annotations. We present a comparative analysis of five teleprompter algorithms, namely, Cooperative Prompt Optimization (COPRO), Multi-Stage Instruction Prompt Optimization (MIPRO), BootstrapFewShot, BootstrapFewShot with Optuna, and K-Nearest Neighbor Few Shot, within the DSPy framework with respect to their ability to align with human evaluations. As a concrete example, we focus on optimizing the prompt to align hallucination detection (using LLM as a judge) to human annotated ground truth labels for a publicly available benchmark dataset. Our experiments demonstrate that optimized prompts can outperform various benchmark methods to detect hallucination, and certain telemprompters outperform the others in at least these…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Speech and dialogue systems

MethodsFocus · ALIGN