DialogSum Challenge: Results of the Dialogue Summarization Shared Task
Yulong Chen, Naihao Deng, Yang Liu, Yue Zhang

TL;DR
The DialogSum Challenge evaluated various dialogue summarization methods, revealing significant improvements over baselines but highlighting persistent gaps between automated metrics and human judgment, underscoring the complexity of the task.
Contribution
This paper presents the results of a shared task on dialogue summarization, comparing different approaches and emphasizing the need for better evaluation metrics.
Findings
Models outperform baselines on Rouge scores
Significant gap between automated metrics and human evaluation
Dialogue summarization remains a challenging task
Abstract
We report the results of DialogSum Challenge, the shared task on summarizing real-life scenario dialogues at INLG 2022. Four teams participate in this shared task and three submit their system reports, exploring different methods to improve the performance of dialogue summarization. Although there is a great improvement over the baseline models regarding automatic evaluation metrics, such as Rouge scores, we find that there is a salient gap between model generated outputs and human annotated summaries by human evaluation from multiple aspects. These findings demonstrate the difficulty of dialogue summarization and suggest that more fine-grained evaluatuion metrics are in need.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems
