Subword Semantic Hashing for Intent Classification on Small Datasets

Kumar Shridhar; Ayushman Dash; Amit Sahu; Gustav Grund Pihlgren; Pedro; Alonso; Vinaychandran Pondenkandath; Gyorgy Kovacs; Foteini Simistira; Marcus; Liwicki

arXiv:1810.07150·cs.CL·January 15, 2020

Subword Semantic Hashing for Intent Classification on Small Datasets

Kumar Shridhar, Ayushman Dash, Amit Sahu, Gustav Grund Pihlgren, Pedro, Alonso, Vinaychandran Pondenkandath, Gyorgy Kovacs, Foteini Simistira, Marcus, Liwicki

PDF

3 Repos

TL;DR

This paper proposes using Semantic Hashing as an embedding method for intent classification, especially effective on small datasets with vocabulary issues and spelling errors, achieving state-of-the-art results on three benchmarks.

Contribution

The paper introduces Semantic Hashing for intent classification, addressing vocabulary dependency and spelling errors, and demonstrates superior performance on multiple small datasets.

Findings

01

Achieved state-of-the-art performance on AskUbuntu, Chatbot, and Web Application datasets.

02

Semantic Hashing effectively handles out-of-vocabulary terms and spelling errors.

03

Outperforms traditional word embedding methods on small intent classification datasets.

Abstract

In this paper, we introduce the use of Semantic Hashing as embedding for the task of Intent Classification and achieve state-of-the-art performance on three frequently used benchmarks. Intent Classification on a small dataset is a challenging task for data-hungry state-of-the-art Deep Learning based systems. Semantic Hashing is an attempt to overcome such a challenge and learn robust text classification. Current word embedding based are dependent on vocabularies. One of the major drawbacks of such methods is out-of-vocabulary terms, especially when having small training datasets and using a wider vocabulary. This is the case in Intent Classification for chatbots, where typically small datasets are extracted from internet communication. Two problems arise by the use of internet communication. First, such datasets miss a lot of terms in the vocabulary to use word embeddings efficiently.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.