An Empirical Evaluation of Encoder Architectures for Fast Real-Time Long Conversational Understanding
Annamalai Senthilnathan, Kristjan Arumae, Mohammed Khalilia,, Zhengzheng Xing, Aaron R. Colak

TL;DR
This paper compares various efficient Transformer variants and CNN-based models for real-time long conversational understanding, finding CNNs to be faster and more memory-efficient while maintaining competitive performance.
Contribution
It provides an empirical evaluation of recent Transformer variants and CNN architectures for long sequence conversational tasks in real-time settings.
Findings
CNN models are ~2.6x faster to train
CNN inference is ~80% faster
CNN models are ~72% more memory efficient
Abstract
Analyzing long text data such as customer call transcripts is a cost-intensive and tedious task. Machine learning methods, namely Transformers, are leveraged to model agent-customer interactions. Unfortunately, Transformers adhere to fixed-length architectures and their self-attention mechanism scales quadratically with input length. Such limitations make it challenging to leverage traditional Transformers for long sequence tasks, such as conversational understanding, especially in real-time use cases. In this paper we explore and evaluate recently proposed efficient Transformer variants (e.g. Performer, Reformer) and a CNN-based architecture for real-time and near real-time long conversational understanding tasks. We show that CNN-based models are dynamic, ~2.6x faster to train, ~80% faster inference and ~72% more memory efficient compared to Transformers on average. Additionally, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems
