Discovering Elementary Discourse Units in Textual Data Using Canonical Correlation Analysis
Akanksha Mehndiratta, Krishna Asawa

TL;DR
This paper introduces an unsupervised, linear, and language-independent model using Canonical Correlation Analysis to identify Elementary Discourse Units in text, demonstrating competitive performance in textual similarity tasks.
Contribution
It proposes a novel unsupervised EDU segmentation method based on CCA, with a strong theoretical foundation and practical effectiveness in content selection tasks.
Findings
EDUs deliver competitive results in textual similarity tasks
The model outperforms some supervised techniques despite simplicity
The approach is adaptable and language independent
Abstract
Canonical Correlation Analysis (CCA) has been exploited immensely for learning latent representations in various fields. This study takes a step further by demonstrating the potential of CCA in identifying Elementary Discourse Units(EDUs) that captures the latent information within the textual data. The probabilistic interpretation of CCA discussed in this study utilizes the two-view nature of textual data, i.e. the consecutive sentences in a document or turns in a dyadic conversation, and has a strong theoretical foundation. Furthermore, this study proposes a model for Elementary Discourse Unit(EDU) segmentation that discovers EDUs in textual data without any supervision. To validate the model, the EDUs are utilized as textual unit for content selection in textual similarity task. Empirical results on Semantic Textual Similarity(STSB) and Mohler datasets confirm that, despite…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies
