Three Ways of Using Large Language Models to Evaluate Chat
Ond\v{r}ej Pl\'atek, Vojt\v{e}ch Hude\v{c}ek, Patricia, Schmidtov\'a, Mateusz Lango, Ond\v{r}ej Du\v{s}ek

TL;DR
This paper explores three methods using large language models to evaluate chatbot responses, demonstrating improvements with ChatGPT and analyzing the performance of Llama 2 models in a competitive setting.
Contribution
Introduces three novel approaches leveraging LLMs for turn-level chatbot response evaluation, including dynamic few-shot prompting and analysis of open-source models.
Findings
Dynamic few-shot prompts improve ChatGPT evaluation accuracy.
Llama 2 models are closing the performance gap with ChatGPT.
Llama 2 models do not benefit from few-shot examples as much as ChatGPT.
Abstract
This paper describes the systems submitted by team6 for ChatEval, the DSTC 11 Track 4 competition. We present three different approaches to predicting turn-level qualities of chatbot responses based on large language models (LLMs). We report improvement over the baseline using dynamic few-shot examples from a vector store for the prompts for ChatGPT. We also analyze the performance of the other two approaches and report needed improvements for future work. We developed the three systems over just two weeks, showing the potential of LLMs for this task. An ablation study conducted after the challenge deadline shows that the new Llama 2 models are closing the performance gap between ChatGPT and open-source LLMs. However, we find that the Llama 2 models do not benefit from few-shot examples in the same way as ChatGPT.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗davidkim205/komt-llama2-7b-v1model· 14 dl· ♡ 314 dl♡ 3
- 🤗davidkim205/komt-llama2-7b-v1-loramodel· 8 dl· ♡ 28 dl♡ 2
- 🤗davidkim205/komt-llama2-7b-v1-ggmlmodel· 194 dl· ♡ 8194 dl♡ 8
- 🤗davidkim205/komt-llama2-13b-v1model· 5 dl· ♡ 55 dl♡ 5
- 🤗davidkim205/komt-llama2-13b-v1-loramodel· 5 dl· ♡ 35 dl♡ 3
- 🤗davidkim205/komt-llama2-13b-v1-ggmlmodel· 137 dl· ♡ 6137 dl♡ 6
- 🤗davidkim205/komt-llama-30b-v1model· 23 dl· ♡ 123 dl♡ 1
- 🤗davidkim205/komt-llama-30b-v1-loramodel· 5 dl5 dl
- 🤗davidkim205/komt-mistral-7b-v1model· 125 dl· ♡ 32125 dl♡ 32
- 🤗davidkim205/komt-mistral-7b-v1-loramodel· 7 dl· ♡ 37 dl♡ 3
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · AI in Service Interactions · Machine Learning in Healthcare
