Three Ways of Using Large Language Models to Evaluate Chat

Ond\v{r}ej Pl\'atek; Vojt\v{e}ch Hude\v{c}ek; Patricia; Schmidtov\'a; Mateusz Lango; Ond\v{r}ej Du\v{s}ek

arXiv:2308.06502·cs.CL·August 15, 2023·2 cites

Three Ways of Using Large Language Models to Evaluate Chat

Ond\v{r}ej Pl\'atek, Vojt\v{e}ch Hude\v{c}ek, Patricia, Schmidtov\'a, Mateusz Lango, Ond\v{r}ej Du\v{s}ek

PDF

Open Access 2 Repos 10 Models

TL;DR

This paper explores three methods using large language models to evaluate chatbot responses, demonstrating improvements with ChatGPT and analyzing the performance of Llama 2 models in a competitive setting.

Contribution

Introduces three novel approaches leveraging LLMs for turn-level chatbot response evaluation, including dynamic few-shot prompting and analysis of open-source models.

Findings

01

Dynamic few-shot prompts improve ChatGPT evaluation accuracy.

02

Llama 2 models are closing the performance gap with ChatGPT.

03

Llama 2 models do not benefit from few-shot examples as much as ChatGPT.

Abstract

This paper describes the systems submitted by team6 for ChatEval, the DSTC 11 Track 4 competition. We present three different approaches to predicting turn-level qualities of chatbot responses based on large language models (LLMs). We report improvement over the baseline using dynamic few-shot examples from a vector store for the prompts for ChatGPT. We also analyze the performance of the other two approaches and report needed improvements for future work. We developed the three systems over just two weeks, showing the potential of LLMs for this task. An ablation study conducted after the challenge deadline shows that the new Llama 2 models are closing the performance gap between ChatGPT and open-source LLMs. However, we find that the Llama 2 models do not benefit from few-shot examples in the same way as ChatGPT.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · AI in Service Interactions · Machine Learning in Healthcare