A Two-dimensional Zero-shot Dialogue State Tracking Evaluation Method using GPT-4
Ming Gu, Yan Yang

TL;DR
This paper introduces a novel two-dimensional zero-shot evaluation method for dialogue state tracking using GPT-4, focusing on accuracy and completeness, and improves evaluation reliability over traditional methods.
Contribution
It proposes a new GPT-4 based zero-shot evaluation framework with manual reasoning prompts, enhancing DST assessment beyond exact matching.
Findings
Outperforms baseline evaluation methods
Achieves better consistency with traditional evaluation
Demonstrates effectiveness of manual reasoning prompts
Abstract
Dialogue state tracking (DST) is evaluated by exact matching methods, which rely on large amounts of labeled data and ignore semantic consistency, leading to over-evaluation. Currently, leveraging large language models (LLM) in evaluating natural language processing tasks has achieved promising results. However, using LLM for DST evaluation is still under explored. In this paper, we propose a two-dimensional zero-shot evaluation method for DST using GPT-4, which divides the evaluation into two dimensions: accuracy and completeness. Furthermore, we also design two manual reasoning paths in prompting to further improve the accuracy of evaluation. Experimental results show that our method achieves better performance compared to the baselines, and is consistent with traditional exact matching based methods.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobotics and Automated Systems · Context-Aware Activity Recognition Systems · IoT-based Smart Home Systems
MethodsDynamic Sparse Training · Residual Connection · Softmax · Layer Normalization · Byte Pair Encoding · Label Smoothing · Adam · Attention Is All You Need · Linear Layer · Multi-Head Attention
